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Beating the Perils of Non-Convexity: 

Guaranteed Training of Neural Networks using Tensor Methods 
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Abstract 

Training neural networks is a challenging non-convex optimization problem, and backpropa- 
gation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm 
based on tensor decomposition for guaranteed training of two-layer neural networks. We pro¬ 
vide risk bounds for our proposed method, with a polynomial sample complexity in the relevant 
parameters, such as input dimension and number of neurons. While learning arbitrary tar¬ 
get functions is NP-hard, we provide transparent conditions on the function and the input for 
learnability. Our training method is based on tensor decomposition, which provably converges 
to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple 
embarrassingly parallel linear and multi-linear operations, and is competitive with standard 
stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a 
computationally efficient method with guaranteed risk bounds for training neural networks with 
one hidden layer. 

Keywords: neural networks, risk bound, method-of-moments, tensor decomposition 


1 Introduction 


Neural networks have revolutionized performance across multiple domains such as computer vision 
and speech recognition. They are flexible models trained to approximate any arbitrary target 
function, e.g., the label function for classification tasks. They are composed of multiple layers 
of neurons or activating functions, which are applied recursively on the input data, in order to 
predict the output. While neural networks have been extensively employed in practice, a complete 
theoretical understanding is currently lacking. 

Training a neural network can be framed as an optimization problem, where the network pa¬ 
rameters are chosen to minimize a given loss function, e.g., the quadratic loss function over the error 
in predicting the output. The performance of training algorithms is typically measured through 
the notion of risk, which is the expected loss function over unseen test data. A natural question to 
ask is the hydness of training a neural network with a bounded risk. The findings are mostly neg¬ 
ative (jRoiasl. 


1996 : Sima , 20021 : Blum and Rivest . 1993 : Bartlett and Ben-David . 1999 : Kuhlmarim 
2000). Training even a simple network is NP-hard, e.g., a network with a single neuron (jSimal . 


2003). 
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The computational hardness of training is due to the non-convexity of the loss function. In 
general, the loss function has many critical points, which include spurious local optima and saddle 
points. In addition, we face curse of dimensionality, and the number o f critical points grow s 


exponentially with the input dimension for general non-convex problems ( Dauphin et ah . 20141 ). 


Popular local search methods such as gradient descent or backpropagation can get stuck in bad local 
optima and experience arbitrarily slow convergence. Explicit examples of its failure an d the presence 


1989; 

Gori and Tesi. 

I992I: 

Frasconi et ah. 


of bad local optima in even simple separable se ttings have been documented before (jBradv et al. 

I 997 I ): see Section [6T] for a discussion. 


Alternative methods for training neural networks have been mostly limited to specific activa¬ 


tion functions (e.g., linear or quadratic), specific target functions (e.g., polynomials) (lAndoni et al.l. 
2014), or assume strong assumptions on the input (e.g., Gaussian or product distribution! (lAndoni et al 


2014l !. see related work for details. Thus, up until now, there is no unified framework for training 


networks with general input, output and activation functions, for which we can provide guaranteed 
risk bound. 

In this paper, for the first time, we present a guaranteed framework for learning general tar¬ 
get functions using neural networks, and simultaneously overcome computational, statistical, and 
approximation challenges. In other words, our method has a low computational and sample com¬ 
plexity, even as the dimension of the optimization grows, and in addition, can also handle approxi¬ 
mation errors, when the target function may not be generated by a given neural network. We prove 
a guaranteed risk bound for our proposed method. NP-hardness refers to the computational com¬ 
plexity of training worst-case instances. Instead, we provide transparent conditions on the target 
functions and the inputs for tractable learning. 

Our training method is based on the method of moments, which involves decomposing the em¬ 
pirical cross moment between output and some function of input. While pairwise moments are 
represented using a matrix, higher order moments require tensors, and the learning problem can 
be formulated as tensor decomposition. A CP (CanDecomp/Parafac) decomposition of a tensor in¬ 
volves finding a succinct sum of rank-one components that best fit the input tensor. Even though it 
is a non-convex problem, the global optimum of tensor decomposition can be ac hieved using compu 
tationally efficient techniques, under a set of mild non-degeneracy conditions (jAnandkumar et al. 


2m 4a 


ally 

30, 


20151 : iBhaskara et al.l . 120131 '). These methods have been recen tly employed for learning 


a wide range of latent variable models (jAnandkumar et al.l . l2014bl . l2013l ) . 


Incorporating tensor methods for training neural networks requires addressing a number of non¬ 
trivial questions: What form of moments are informative about network parameters? Earlier works 
using tensor methods for learning assume a linear relationship between the hidden and observed 
variables. However, neural networks possess non-linear activation functions. How do we adapt 
tensor methods for this setting? How do these methods behave in the presence of approximation 
and sample perturbations? How can we establish risk bounds? We address these questions shortly. 


1.1 Summary of Results 

The main contributions are: (a) we propose an efficient algorithm for training neural networks, 
termed as Neural Network-Learning using Feature Tensors (NN-LIFT), (b) we demonstrate that the 
method is embarrassingly parallel and is competitive with standard SGD in terms of computational 
complexity, and as a main result, (c) we establish that it has bounded risk, when the number of 
training samples scales polynomially in relevant parameters such as input dimension and number 
of neurons. 
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We analyze training of a two-layer feedforward neural network, where the second layer has 
a linea r activation fuiiction. This is the c^ssical neural network considered in a number of 
works ( Cvbenk3 . 1989b : Hornik . 1991 : Barron . 1994l b and a natural starting point for the analysis 
of any learning algorithm. Note that training even this two-layer network is non-convex, and finding 
a computationally efficient method with guaranteed risk bound has been an open problem up until 
now. 

At a high level, NN-LIFT estimates the weights of the first layer using tensor CP decomposition. 
It then uses these estimates to learn the bias parameter of first layer using a simple Fourier tech¬ 
nique, and finally estimates the parameters o f last layer using linear reg re sion. NN-LIFT consists 


of simple linear and multi-linear operations ( Anandkumar et ah . 2014b 3, 20151 ). Fourier analysis 


and ridge regression analysis, which are parallelizable to large-scale data sets. The computational 
complexity is comparable to that of the standard SGD; in fact, the parallel time complexity for 
both the methods is in the same order, and our method requires more processors than SGD by a 
multiplicative factor that scales linearly in the input dimension. 


Generative vs. discriminative models: Generative models incorporate a joint distribution 
p{x, y) over both the input x and label y. On the other hand, discriminative models such as neural 
networks only incorporate the conditional distribution p{y\x). While training neural networks 
for general input x is NP-hard, does knowledge about the input distribution p(x) make 
learning tractable? 

In this work, we assume knowledge of the input densit y ip(x). which can b e any continuous 
differentiable function. Unlike many theoretical works, e.g., (|Andoni et ahl . l2014l h we do not limit 
ourselves to distributions such as product or Gaussian distributions for the input. While unsu¬ 
pervised learning, i.e., estimation of density p{x), is itself a hard problem for general models, in 
this work, we investigate how p{x) can be exploited to make training of neural networks tractable. 
The knowledge of p{x) is naturally available in the experimental design framework, where the per¬ 
son designing the experiments has the ability to choose the input distribution. Examples include 
conducting polling, carrying out drug trials, collecting survey information, and so on. 


Utilizing generative models on the input via score functions: We utilize the knowledge 
about the input density p{x) (up to normalization)0 to obtain certain (non-linear) transformations 
of the input, given by the class of score functions. Score functions are normalized derivatives of 
the input pdf; see ([8]). If the input is a vector (the typical case), the first order score function (i.e., 
the first derivative) is a vector, the second order score is a matrix, and the higher order scores are 
tensors. In our NN-LIFT method, we first estimate the cross-moments between the output and the 
input score functions, and then decompose it to rank-1 components. 


Risk bounds: Risk bound includes both approximation and estimation errors. The approxima¬ 
tion error is the error in fitting the target function to a neural network of given architecture, and 
the estimation error is the error in estimating the weights of that neural network using the given 
samples. 

We first consider the realizable setting where the target function is generated by a two-layer 
neural network (with hidden layer of neurons consisting of any general sigmoidal activations), and 

^We d o not require the knowledge o f the normalizing constant or the partition function, which is hard to 
compute dWainwright and Jordanl . LoOSl i. 
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a linear output layer. Note that the approximation error is zero in this setting. Let Ai G be 
the weight matrix of first layer (connecting the input to the neurons) with k denoting the number 
of neurons and d denoting the input dimension. Suppose these weight vectors are non-degenerate, 
i.e., the weight matrix Ai (or its tensorization) is full column rank. We assume continuous input 
distribution with access to score functions, which are bounded on any set of non-zero measure. We 
allow for any general sigmoidal activation functions with non-zero third derivatives in expectation, 
and satisfying Lipschitz property. Let Smin(') be the minimum singular value operator, and M-i{x) € 
denote the matricization of input score function tensor S2,{x) G see m and dl]) for 

the definitions. For the Gaussian input x ~ Af{ 0 ,ld), we have E [||M3(x)M3'(x)||] = O (d^). We 
have the following learning result in the realizable setting where the target function is generated 
by a two layer neural network (with one hidden layer). 

Theorem 1 (Informal result for realizable setting). Our method NN-LIFT learns a realizable target 
function up to error e when the number of samples is lower bounded ajl, 


n> O 


•E 


M3ix)Mjix 


gmax(^l) A 

4in(^l)/ 


Thus, we can efficiently learn the neural network parameters with polynomial sample complex¬ 
ity using NN-LIFT algorithm. In addition, the method has polynomial computational complexity, 
and in fact, its parallel time complexity is the same as stochastic gradient descent (SGD) or back- 
propagation. See Theorem [3] for the formal result. 

We then extend our results to the non-realizable setting where the target function need not be 
generated by a neural network. For our method NN-LIFT to succeed, we require the approximation 
error to be sufficiently small under the given network architecture. Note that it is not of practical 
interest to consid er functions wi th large approximation errors, since classification performance in 
that case is poor ( Bartlett . 19981 ). We state the informal version of the result as follows. 

We assume the following: the target function f(x) has a continuous Fourier spectrum and is 
sufficiently smooth, i.e., the parameter Cf (see (|13p for the definition) is sufficiently small as specified 
in (I18|) . This implies that the approximation error of the target function can be controlled, i.e., there 
exists a neural network of given size that can fit the target function with bounded approximation 
error. Let the input x be bounded as ||x|| < r. Our informal result is as follows. See Theorem [5] 
for the formal result. 


Theorem 2 (Informal result for non-realizable setting). The arbitrary target function f{x) is 
approximated by the neural network f{x) which is learnt using NN-LIFT algorithm such that the 
risk bound satisfies w.h.p. 

IEx[|/(x) - /(x)P] < 0{r^C}) • + dl) + 0(e2), 

where k is the number of neurons in the neural network, and 5r is defined in m- 

In the above bound, we require for the target function f{x) to have bounded first order moment 
in the Fourier spectrum; see (IlSp . As an example, we show that this bound is satisfied for the class 
of scale and location mixtures of the Gaussian kernel function. 

^Here, only the dominant terms in the sample complexity are noted; see Gil) for the full details. 
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Corollary 1 (Learning mixtures of Gaussian kernels). Let f{x) := f K(a(x+/3))G(da,d/3), a > 0, 
13 G be a location and scale mixture of the Gaussian kernel function K{x) = exp the 

input be Gaussian as x ^ Af{0,a‘^ld), and the activations be step functions, then, our algorithm 
trains a neural network with risk bounds as in Theorem d when 

j |a| • \G\{da,df3) < poly Q, ^e, ^,exp (-1/cr^)^ . 

We observe that when the kernel mixtures correspond to smoother functions (smaller a), the 
above bound is more likely to be satisfied. This is intuitive since smoother functions have lower 
amount of high frequency content. Also, notice that the above bound has a dependence on the 
variance of the Gaussian input ax- We obtain the most relaxed bound (r.h.s. of above bound) for 
middle values of ax, i.e., when ax is neither too large nor too small. See Appendix ID . 1 1 for more 
discussion and the proof of the corollary. 


Intuitions behind the conditions for the risk bound: Since there exist worst-case instances 
where learning is hard, it is natural to expect that NN-LIFT has guarantees only when certain 
conditions are met. We assume that the input has a regular continuous probability density 
function (pdf); see (llip for the details. This is a reasonable assumption, since under Boolean 
inputs (a speci al case of discrete i n put), it reduces to learning parity with noise which is a 
hard problem ([Kearns and Vaziranil . ll994l L We assume that the activating functions are suf- 
ficient ly non-linear, sin c e if t hey are linear, then the network can be collapsed into a single 
layer ( Baldi and Hornik . 19891 ). which is non-identihable. We precisely characterize how the es¬ 
timation error depends on the non-linearity of the activating function through its third order 
derivative. 

Another condition for providing the risk bound is non-redundancy of the neurons. If the neurons 
are redundant, it is an over-specihed network. In the realizable setting, where the target function is 
generated by a neural network with the given number of neurons k, we require (tensorizations of) 
the weights of hrst layer to be linearly independent. In the non-realizable setting, we require this to 
be satisfied by k vectors randomly drawn from the Fourier magnitude distribution (weighted by the 
norm of frequency vector) of the target function f{x). More precisely, the random frequencies are 
drawn from probability distribution A(a;) := ||w|| • \F{u})\/Gf where F(u}) is the Fourier transform of 
arbitrary function f{x), and Gf is the normalization factor; see (1261) and corresponding discussions 
for more details. This is a mild condition which holds when the distribution is continuous in some 
domain. Thus, our conditions for achieving bounded risk are mild and encompass a large class of 
target functions and input distributions. 


Why tensors are required? We employ the cross-moment tensor which encodes the correlation 
between the third order score function and the output. We then decompose the moment tensor 
as a sum of rank-1 components to yield the weight vectors of the first layer. We require at least 
a third order tensor to learn the neural network weights for the following reasons: while a ma¬ 
trix decomposition is only identifiable up to orthogonal components, tensors can have identihable 
non-orthogonal components. In general, it is not realistic to assume that the weight vectors are 
orthogonal, and hence, we require tensors to learn the weight vectors. Moreover, through ten¬ 
sors, we can learn overcomplete networks, where the number of hidden neurons can exceed the 
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input/output dimensions. Note that matrix factorization methods are unable to learn overcom¬ 
plete models, since the rank of the matrix cannot exceed its dimensions. Thus, it is critical to 
incorporate tensors for training neural networks. A recent set of papers have analyzed the tensor 


metho ds i n detaih and established co nvergence and perturbation guarantees (jAnandkumar et al. 


2014al jbljd. I 2 OI 5 I : iBhaskara et al.l . l2013l ). despite non-convexity of the decomposition problem. Such 


strong theoretical guarantees are essential for deriving provable risk bounds for NN-LIFT. 

Extensions: Our algorithm NN-LIFT can be extended to more layers, by recursively estimating 

the weights layer by layer. In principle, our analysis can be extended by controlling the perturbation 
introduced due to layer-by-layer estimation. Establishing precise guarantees is an exciting open 
problem. 

In this work, we assume knowledge of the generative model for the input. As argued before, 
in many settings such as experimental design or polling, the design of the input pdf p{x) is under 
the control of the learner. Even if p{x) is not known, a recent flurry of research activity has 
shown that a wide class of probabilistic models c an be trained consisten t ly usi ng a suite of different 
efficient a lgorithms: convex re l axation methods ( Chandrasekaran et ah . 201 oIL spectral and tensor 
methods ( Anandkumar et ah . 2ni4bl ). alternating minimization ( Agarwal et ah . 20141 ) . and they 
require only polynomial sample and computational complexity, with respect to the input and hidden 
dimensions. These methods can learn a rich class of models which also includes latent or hidden 
variable models. 

Another aspect not addressed in this work is the issue of regularization for our NN-LIFT 
algorithm. In this work, we assume that the number of neurons is chosen appropriately to balance 
bias an d variance through cross validation. Designing implicit regulariz ation methods such as 


dropout ( Hinton et ah . 2012l i or early stopping ( Morgan and Bourlard . 1989l i for tensor factorization 


and analyzing them rigorously is another exciting open research problem. 

1.2 Related works 

We first review some works regarding the analysis of backpropagation, and then provide some 
theoretical results on training neural networks. 


Analysis of backpropagation and loss snrface of optimization: Baldi and HornikI (Il989l ) 


show that if the activations are linear, then backpropagation has a unique local optimum, and 
it corresponds to the principal components of the covariance matrix of the training examples. 
However, it is kno wn that there e xist n etworks with non-linear activations where backpropagation 
fails; for instance, Bradv et al. ( 1989l i construct simple cases of linearly separable classes that 
backpropaga tion fails. Note that the simple perceptron algorithm will succeed here due to linear 


separability. iGori and Tesil (jl992l l argue that such examples are artificial and that backpropagation 


succeeds in reaching the global optimum for linearly separable classes in practical settings. However, 
they show that unde r non-linear separabil ity, backpropagation can get stuck in local optima. For 
a detailed su rvey, see Frasconi et al.l (Il997 i. 

Recently, Choromanska et al. ( 2015l i analyze the loss surface of a multi-layer ReLU network by 
relating it to a spin glass system. They make several assumptions such as variable independence 
for the input, equally likely paths from input to output, redundancy in network parameterization 
and uniform distribution for unique weights, which are far from realistic. Under these assumptions. 
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the network reduces to a random spin glass models where it is known that the lowest critical values 
of the random loss function form a layered structure, an d the number of local minima outside 


that band diminishes exponentially with the network size ( Auffinger et ah . 2013l i. However, this 


does not imply computational efficiency: there is no guarantee that we can find such a good local 
optimal point using computationally cheap algorithms, since there are still exponential number of 
suc h points. _ 


Haeffele and Vidall (|2015l i provide a general framework for characterizing when local optima 


become global in deep learning and other scenarios. The idea is that if the network is sufficiently 
overspecihed (i.e., has enough hidden neurons) such that there exist local optima where some of the 
neurons have zero contribution, then such local optima are in fact, global. This provides a simple 
and a unified characterization of local optima which are global. However, in general, it is not clear 
how to design algorithms that can reach these efficient optimal points. 


Previous theoretical works for training neural networks: Analysis of risk for neural net¬ 
works is a classical pro blem. Approxima t ion error of two layer n eural net work ha s been analyzed 
in a number of works ( Cvbenkq . 1989b : Hornik . 1991 : Barron . 19941 ). Barron ( 19941 ) provides 
a bound on the approximation error and combines it with the estimation error to obtain a risk 
bound, but for a computationally ineffic ient me t hod. The sample com plexity for neural net¬ 
works have been extensively analyzed in Barron ( 1994 ): Bartlett fll998). assuming conv e rgenc e 
to the globally optimal solution , whic h in general is intractable. See Anthony and Bartlett ( 20091 ): 


Sha 


ev-Sh wartz and Ben-David ( 20141 ) for an exposition of classical results on neural networks. 


Andoni et al.l ( 20141 ) learn polynomial target functions using a two-layer neural network under 


Gaussian/uniform input distribution. They argue that the weights for the hrst layer can be selected 
randomly, and only the second layer weights, which are linear, need to be fitted optimally. However, 
in practice. Gaussian/uniform distributions are never encountered in classihcation problems. For 
general distributions, random weights in the first layer is no t sufficient. Under our framework, we 
impose only mild non-degeneracy conditions on the weights. Livni et al. ( 20141 ) make the observa¬ 
tion that networks with quadratic activation functions can be trained in a computationally efficient 
manner in an incremental manner. This is because with quadratic activations, greedily adding one 
neuron at a time can be solved efficiently through eigen decomposition. However, the standard sig¬ 
moidal networks require a large depth polynomial n etwork, which is not practical. After we posted 
the initial version of this paper, IZhang et al.l (j2015|) extended this framework to improper learning 
scenario, where the output predictor need not be a neural network. They show that if the norm 
of the incoming weights in each layer is bounded, then learning is efficient. However, for the usual 
neural networks with sigmoidal activations, the norm o f the weights scales with dimension and 
in this case, the algorithm is no longer polynomial time. lArora et al.l (j2013l ) provide bounds for 
leaning a class of deep representations. They use layer-wise learning where the neural network is 
learned layer-by-layer in an unsupervised manner. They assume sparse edges with random bounded 
weights, and 0/1 threshold functions in hidden nodes. The difference is here, we are considering 
the supervised setting where there is both input and output, and we allow for general sigmoidal 
functions at the hidden neurons. 


Recently, after posting the initial version of this paper, iHardt et al.l (j2015l ) provided an analysis 
of stochastic gradient descent and its generalization error in convex and non-convex problems such 
as training neural networks. They show that the generalization error can be controlled under mild 
conditions. However, their work does not address about reaching a solution with small risk bound 
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using SGD, and the SGD in general can get stuck in a spurious local optima. On the other hand, 
we show that in addition to having a small generalization error, our method yields a neural network 
with a small risk bound. Note that our method is moment-based estimation, and these methods 
come with stability bounds that guarantee good generalization e rror. 

Closely related to this paper. ISedghi and Anandkumaii (j2014l l consider learning neural networks 
with sparse connectivity. They employ the cross-moment between the (multi-class) label and (first 
order) score function of the input. They show that they can provably learn the weights of the first 
layer, as long as the weights are sparse enough, and there are enough number of input dimensions 
and output classes (at least linear up to log factor in the number of neurons in any layer). In this 
paper, we remove these restrictions and allow for the output to be just binary class (and indeed, 
our framework applies for multi-class setting as well, since the amount of information increases 
with more label classes from the algorithmic perspective), and for the number of neurons to exceed 
the input/output dimensions (overcomplete setting). Moreover, we extend beyond the realizable 
setting, and do not require the target functions to be generated from the class of neural networks 
under consideration. 


2 Preliminaries and Problem Formulation 

We first introduce some notations and then propose the problem formulation. 

2.1 Notation 

Let [n] := {1,2,... ,n}, and ||rt|| denote the £2 or Euclidean norm of vector u, and {u,v) denote 
the inner product of vectors u and v. For matrix C G the j-th column is referred by Cj or 

Cj, j G [k]. Throughout this paper, denotes the m-th order derivative operator w.r.t. variable 
X. For matrices A,B^ the Khatri-Rao product C := AQ B ^ is dehned such that 

C{1 + {i- l)d,j) = A{i,j) ■ B{l,j), for i,l G [d],j G [k]. 

Tensor: A real m-th order tensor T G is a member of the outer product of Euclidean 

spaces The different dimensions of the tensor are referred to as modes. For instance, for a 
matrix, the first mode refers to columns and the second mode refers to rows. 

Tensor matricization: For a third order tensor T G the matricized version along first 

mode denoted by M G is dehned such that 

T{i,j,l)=M{i,l-^{j-l)d), i,j,le[d]. (1) 

Tensor rank: A 3rd order tensor T G is said to be rank-1 if it can be written in the form 

T = w- a^b^c4^ T{iij, 1) = w ■ a{i) ■ b{j) ■ c{l), (2) 

where (8> represents the outer product, and a,b,c G are unit vectors. A tensor T G is 

said to have a CP (Candecomp/Parafac) rank k if it can be (minimally) written as the sum of k 
rank-1 tensors 


T = ^ Wiai ®bi® Ci, 


Wi G M, Oj, bi, Ci G 


(3) 





Derivative: For function g{x) : —)• M with vector input x G the m-th order derivative 

w.r.t. variable x is denoted by G (which is a m-th order tensor) such that 




dg{x) 


J 21, 


dxi^dxi^ ■ ■ ■ dxi^ 


zi,. 


G[d]. 


(4) 


When it is clear from the context, we drop the subscript x and write the derivative as V^'^'^g{x) 


Fourier transform: For a function f{x) : — >■ M, the multivariate Fourier transform F{uj) : 

> M is defined as 

F{oj) := [ f{x)e-^^^'^Ux, (5) 

jRd- 

where variable w G is called the frequency variable, and j denotes the imaginary unit. We also 
denote the Fourier pair {f{x),F{uj)) as f{x) F{(jj). 


Function notations: Throughout the paper, we use the following convention to distinguish 
different types of functions. We use /(x) (or y) to denote an arbitrary function and exploit f{x) 
(or y) to denote the output of a realizable neural network. This helps us to differentiate between 
them. We also use notation f{x) (or y) to denote the estimated (trained) neural networks using 
finite number of samples. 


2.2 Problem formulation 


We now introduce the problem of training a neural network in realizable and non-realizable settings, 
and elaborate on the notion of risk bound on how the trained neural network approximates an 
arbitrary function. It is known that continuous functions with compact domain can be arbitrarily 
well appro ximated by feedforward neural networks with one h idden layer and sigmoidal nonlinear 
functions ( Cvbenk3 . 1989a : Hornik et ah . 19891 : Barron . 1993l l. 

The input (feature) is denoted by variable x, and output (label) is denoted by variable y. We 
assume the input and output are generated according to some joint density function p(x, y) such 
that {xi,yi) p{x,y), where {xi,yi) denotes the z-th sample. We assume knowledge of the input 
density p{x), and demonstrate how it can be used to train a neural network to approximate the 
conditional density p{y\x) in a computationally efficient manner. We discuss in Section [3.II how the 
input density p{x) can be estimated through numerous methods such as score matching or spectral 
methods. In settings such as experimental design, the input density p{x) is known to the learner 
since she designs the density function, and our framework is directly applicable there. In addition, 
we do not need to know the normalization factor or the partition function of the input distribution 
p{x), and the estimation up to normalization factor suffices. 


Risk bound: In this paper, we propose a new algorithm for training neural networks and provide 
risk bounds with respect to an arbitrary target function. Risk is the expected loss over the joint 
probability density function of input x and output y. Here, we consider the squared I 2 loss and 
bound the risk error 




( 6 ) 
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Figure 1: Graphical representation of a neural network, E[y|a;] = Aja{Ajx + bi) + 62 . Note that this 
representation is for general vector output y which can be also written as E[j/|a;] = {a 2 ,o'{Alx + hi)) + &2 in 
the case of scalar output y. 

where f{x) is an arbitrary function which we want to approximate by f{x) denoting the estimated 
(trained) neural network. This notion of risk is also called mean integrated squared error. The 
proposed risk error for a neural network consists of two parts: approximation error and estimation 
error. Approximation error is the error in fitting the target function f{x) to a neural network with 
the given architecture /(x), and estimation error is the error in training that network with finite 
number of samples denoted by f{x). Thus, the risk error measures the ability of the trained neural 
network to generalize to new data generated by function f{x). We now introduce the realizable 
and non-realizable settings, which elaborates more these sources of error. 

2.2.1 Realizable setting 

In the realizable setting, the output is generated by a neural network. We consider a neural network 
with one hidden layer of dimension k. Let the output y G {0,1} be the binary label, and x G 
be the feature vector; see Section (6.21 for generalization to higher dimensional output (multi-label 
and multi-class), and also the continuous output case. We consider the label generating model 

f{x) := E[y\x] = (a2,(T(Aj x h)) + 62 , (7) 

where it(-) is (linear/nonlinear) elementwise function. See Figured] for a schematic representation 
of label-function in ([7|) in the general case of vector output y. 

In the realizable setting, the goal is to train the neural network in ([7|), i.e., to learn the weight 
matrices (vectors) Ai G 02 G and bias vectors 61 G 62 G This only involves 

the estimation analysis where we have a label-function f{x) specified in ([7|) with fixed unknown 
parameters Ai,bi,a 2 ,b 2 , and the goal is to learn these parameters and finally bound the overall 
function estimation error E 3 ;[|/(x) — /(x)p], where f{x) is the estimation of fixed neural network 
f{x) given finite samples. For this task, we propose a computationally efficient algorithm which 
requires only polynomial number of samples for bounded estimation error. This is the first time 
that such a result has been established for any neural network. 
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Algorithm 1 NN-LIFT (Neural Network Learning using Feature Tensors) 
input Labeled samples {{xi,yi) : i E [n]}, parameter ei, parameter A. 
input Third order score function S‘i{x) of the input x; see Equation ([ 8 ]) for the definition. 
1 : Compute T := i 

2 : jg[fc] = tensor decomposition(T); see Section and Appendix iBl for details. 

3: bi = Fourier method({(xi,: i E [n]}, Ai,ei); see Procedure[2j 
4 : (02,62) = Ridge regression({(xi, yi) : i E [n]}, Ai, 61, A); see Procedure[3l 
5 : return Ai, 02, 61,62. 


2.2.2 Non-realizable setting 


In the non-realizable setting, the output is an arbitrary function f{x) which is not necessarily a 
neural network. We want to approximate f{x) by f{x) denoting the estimated (trained) neural 
network. In this setting, the additional approximation analysis is also required. In this paper, we 
combine the estimation result in realizable setting with the approximation bounds in Barron ( 19931 ) 
leading to risk bounds with respect to the target function /(x); see ([HD for the definition of risk. 
The detailed results are provided in Section [5l 


3 NN-LIFT Algorithm 

In this section, we introduce our proposed method for learning neural networks using tensor, Fourier 
and regression techniques. Our method is shown in Algorithm [1] named NN-LIFT (Neural Network 
Learning using Feature Tensors). The algorithm has three main components. The first component 
involves estimating the weight matrix of the first layer denoted by Ai E by a tensor decompo¬ 
sition method. The second component involves estimating the bias vector of the first layer hi E 
by a Fourier method. We finally estimate the parameters of last layer 02 E and 62 £ by linear 
regression. 

Note that most of the unknown parameters (compare the dimensions of matrix Ai, vectors 02 , 
61, and scalar 62) are estimated in the first part, and thus, this is the main part of the algorithm. 
Given this fact, we also provide an alternative method for the estimation of other parameters of 
the model, given the estimate of Ai from the tensor method. This is based on incrementally adding 
neurons, one by one, whose first layer weights are given by Ai and the remaining parameters 
are updated using brute force search on a grid. Since each update involves just updating the 
corresponding bias term bi, and its contribution to the final output, this is low dimensional, and 
can be done efficiently; details are in Section 16.31 

We now explain the steps of Algorithm [T] in more details. 

3.1 Score function 

The m-th order score function Sm{x) E is defined as 

( 8 ) 

p[x) 

where p{x) is the probability density function of random vector x E In addition, denotes 
the m-th order derivative operator; see (UD for the precise definition. The main property of score 


(Janzamin et ah. 2014) 


11 





















functions as yielding differential operators that enables us to estimate the weight matrix Ai via 
tensor decomposition is discussed in the next snbsection; see Equation ([9]). 

In this paper, we assnme access to a sufficiently good approximation of the input pdf p{x) and 
the corresponding score fnnctions 52(x), S 3 {x). Indeed, estimating these quantities in general is a 
hard problem, but there exist numerous instances where this becomes tractable. Examples include 
spectral methods for learning latent variable models such as Gaussian mixtures, topic o r admixture 
models, independent component analysis (ICA) and so on (lAnandkumar et al.l. l2014bll. Moreover , 


there have been recent advances in non-parametric score matching methods (|Sriperumbndnr et al 


2 OI 3 I I for density estimation in infinite dimensional exponential families with guaranteed conver¬ 
gence rates. These methods can be used to estimate the input pdf in an unsupervised manner. 
Below, we discuss in detail about score function estimation methods. In this paper, we focus on 
how we can use the input generative information to make training of nenral networks tractable. 
Eor simplicity, in the snbseqnent analysis, we assnme that these quantities are perfectly known; it 
is possible to extend the pertnrbation analysis to take into account the errors in estimating the 
inpnt pdf; see Remark [H 


Estimation of score function: There are various efficient methods for estimating the score 
function . The frarnework of score matching is popnlar for parameter estimation in probabilistic 
models (jHvvarinenl . l2005l : ISwerskv et al.l. 120111). where th e criterion is to fit parameters based on 
matching the data score fnnction. ISwerskv et al.l (120111 ) analyze the score matching for latent 
energy-based models. In deep learning, the framework of auto-encoders attempts to find encoding 
and decoding functions which minimize the reconstrnction error under added noise; the so-called 
Denoisin g Anto-Encoders (DAE) . This is an unsupervised framework involving only nnlabeled 
samples. Alain and Bengi^ ( 20121 ) argue that the DAE app roximately learns the fi rst or der score 


function of the inpnt, as the noise variance goes to zero. Sripernmbudur et al. ( 2013ll 


propose 


non-parametric score matching methods for density estimation in infinite dimensional exponen¬ 
tial families with guaranteed convergence ra tes. Therefore, we can use any of these methods for 
estimating 5i(x) and nse the recnrsive form ( Janzamin et ah . 2014l l 


5m(x) = -5m_i(a:) (g) logp(x) - Va;5m-i(a;) 


to estimate higher order score fnnctions. 


3.2 Tensor decomposition 

The score fnnctions are new representations (extracted features) of input data x that can be used for 
training neural networks. We use score functions and labels of training data to form the empirical 
cross-moment T = ^ Z]ie[n] Vi We decompose tensor T to estimate the columns of Ai. The 

following discussion reveals why tensor decomposition is relevant for this task. 

The score functions have the property of yielding differenti al operators with respec t to the inpnt 
distribntion. More precisely, for label-function f{x) := E [nix]. I Janzamin et al.l (|2014l l show that 


E[y53(x)]=E[Vi3)/(x)]. 

Now for the neural network ontput in ([7]), note that the function /(x) is a non-linear function of 
both input x and weight matrix Ai. The expectation operator E[-] averages out the dependency on 
X, and the derivative acts as a linearization operator as follows. In the neural network output ([7]), 
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we observe that the columns of weight vector Ai are the linear coefficients involved with input 
variable x. When taking the derivative of this function, by the chain rule, these linear coefficients 
shows up in the final form. In particular, we show in Lemma [6] (see Section [7]) that for neural 
network in ([7]), we have 

E [y • 53 (x)] = a, • ® ® E (9) 

ie[fc] 

where {Ai)j E denotes the j-th column of j4i, and Xj E M denotes the coefficient; refer to 
Equation ([3]) for the notion of tensor rank and its rank-1 components. This clarifies how the score 
function acts as a linearization operator while the final output is nonlinear in terms of Ai. The 
above form also clarifies the reason behind using tensor decomposition in the learning framework. 


Tensor decomposition algorithm: The goal of tensor decomposition algorithm is to recover 
the rank- 1 components of tensor. Fo r this step, we use the tensor decomposition algorithm pro¬ 


posed in ( Anandkumar et ah . 2ni4bl ^): see Appendix [B] for details. The main step of the tensor 


decomposition method is the tensor power iteration which is the generalization of matrix power 
iteration to 3rd order tensors. The tensor power iteration is given by 


u 


T{I,u,u) 

\\T{I,u,u)\ 


where u E M'^, T{I, u, u) := Ylj i^[d] j, 1) E is a multilinear combination of tensor /ifeersH 

The convergence guarantees of t ensor power iterat i on for orthogonal tensor decomp osition have 


been developed in the literature ( Zhang and Golub . 2001 : Anandkumar et ah . 2014bl i. Note that 


we first orthogonalize the tensor via whitening procedure and then apply the tensor power iteration. 
Thus, the original tensor decomposition need not to be orthogonal. 


Computational Complexity: It is popular to perform the tensor decomposition in a stochastic 
manner which reduces the computational complexity. This is done by splitting the data into mini¬ 
batches. Starting with the first mini-batch, we perform a small number of tensor power iterations, 
and then use the result as initialization for the next mini-batch, and so on. As mentioned earlier, we 
assume that a sufficiently good approximation of score function tensor is given to us. For specific 
cases where we have this tensor in factor form, we can reduce the computational complexity of 
NN-LIFT by not computing the whole tensor explicitly. By having factor form, we mean when we 
can write the score function tensor in terms of summation of rank-1 components which could be 
the summation over samples, or from other existing structures in the model. We now state a few 
examples when we have the factor form, and provide the computational complexity. For example, 
if input follows Gaussian distribution, the score function has a simple polynomial form, and the 
computational complexity of tensor decomposition is 0{nkdR), where n is the number of samples 
and R is the number of initializations for the tensor decomposition. Similar argument follows when 
the input distribution is mixture of Gaussian distributions. 

We can also analyze complexity for more complex inputs. If we fit the input data into a 
Restricted Boltzmann Machines (RBM) model, the computational complexity of our method is 

^Tensor fibers are the vectors which are derived by fixing all the indices of the tensor except one of them, e.g., 
T{:,j,l) in the above expansion. 
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Procedure 2 Fourier method for estimating bi 

input Labeled samples {{xi,yi) : i E [n]}, estimate Ai, parameter ei. 
input Probability density function p{x) of the input x. 

1: for / = 1 to A: do ^ 

2: Let ill := |a; E : ||a;|| = |(a;, (^i)z)| > —and Ifl;! denotes the surface area of d—1 

dimensional manifold O/. 

3 : Draw n i.i.d. random frequencies u}i,i E [n], uniformly from set D;. 

4: Let V := ^ X^ie[n] which is a complex number as v = The operators | • | 

and Z- respectively denote the magnitude and phase operators. 

5: Let bi(l) := ^(Zu — ZS(l/2)), where a(x) S(a;). 

6: end for 
7: return 6i. 


0{nkddhR). Here d^ is the number of neurons of the first layer of the RBM used for approximating 
the input distribution. Tensor methods are also embarrassingly parallelizable. When performed in 
parallel, the computational complexity would be 0(log(min{d, dh})) with 0{nkddhR/ l ogfminfd, dh ) )) 
proce ssors. Alternatively, we can also exploit recent t ensor sketch i ng ap proaches (jWang et al.l . 


2015) for computing tensor decompositions efficiently. Wang et al. ( 2015l i build on the idea of 


count sketches and show that the running time is linear in the input dimension and the number 
of samples, and is independent in the order of the tensor. Thus, tensor decompositions can be 
computed efficiently. 


3.3 Fourier method 

The second part of the algorithm estimates the first layer bias vector bi E This step is 

very different from the previous step for estimating Ai which was based on tensor decomposition 
methods. This is a Fourier-based method where complex variables are formed using labeled data 
and random frequencies in the Fourier domain; see Procedure [2l We prove in Lemma [7] that 
the entries of bi can be estimated from the phase of these complex numbers. We also observe in 
Lemma [7] that the magnitude of these complex numbers can be used to estimate 02 ; this is discussed 
in Appendix 1C. 2 1 


Polynomial-time random draw from set D;: Note that the random frequencies are drawn 
from a d — 1 dimensional manifold denoted by fli which is the intersection of sphere ||cu|| = ^ 

and cone |(a;, (Ai)i)| > ^ in This manifold is actually the surface of a spherical cap. 
In order to draw these frequencies in polynomial time, we consider the d-dimensional spherical 
coordinate system such that one of the angles is defined based on the cone axis. We can then 
directly i mpose the cone constrain t by limiting the corresponding angle in the random draw. In 
addition, Kothari and Meka ( 20I4I ) propose a method for generating pseudo-random variables from 
the spherical cap in logarithmic time. 


Remark 1 (Knowledge of input distribution only up to normalization factor). The computation of 
score function and the Fourier method both involve knowledge about input pdf p{x). However, we do 
not need to know the normalization factor, also known as partition function, of the input pdf. For 
the score function, it is immediately seen from the definition in dH) since the normalization factor 
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Procedure 3 Ridge regression method for estimating 02 and 62 

input Labeled samples {{xi,yi) : i E [n]}, estimates Ai and 61 , regularization parameter A. 

1 : Let hi := a{Ajxi + bi), i € [n], denote the estimation of the neurons. 

2 : Append each neuron hi by the dummy variable 1 to represent the bias, and thus, hi E 
3 : Let := i Yli£[n] ^ denote the empirical covariance of h. 

4 : Let E denote the estimated parameters by A-regularized ridge regression as 

/3a = (S/, + A4+i) • M E 2/*^* I ’ (10) 

^ \*6N / 

where Ik+i denotes the {k + l)-dimensional identity matrix. 

5: return 02 := /3a(1 : k), 62 := $\{k + 1). 


is canceled out by the division by p{x), and thus, the estimation of score function is at most as 
hard as estimation of input pdf up to normalization factor. In the Fourier method, we can use the 
non-normalized estimation of input pdf which leads to a normalization mismatch in the estimation 
of corresponding complex number. This is not a problem since we only use the phase information 
of these complex numbers. 

3.4 Ridge regression method 

For the neural network model in given a good estimation of neurons, we can estimate the 
parameters of last layer by linear regression. We provide Procedure [3] in which we use ridge 
regression algorithm to estimate the parameters of last layer 02 and 62 - See Appendix 1C.31 for the 
details of ridge regression and the corresponding analysis and guarantees. 

4 Risk Bound in the Realizable Setting 

In this section, we provide guarantees in the realizable setting, where the function f{x) := IE[y|x] 
is generated by a neural network as in ([7|). We provide the estimation error bound on the overall 
function recovery E 2 ;[|/(x) — /(x)p] when the estimation is done by Algorithm [TJ 

We provide guarantees in the following settings. 1) In the basic case, we consider the un¬ 
dercomplete regime k < d, and provide the results assuming Ai is full column rank. 2) In the 
second case, we form higher order cross-moments and tensorize it into a lower order tensor. This 
enables us to learn the network in the overcomplete regime k > d, when the Khatri-Rao product 
AiQ Ai E is full column rank. We call this the overcomplete setting and this can handle 

up to A; = 0{d‘^). Similarly, we can extend to larger k by tensorizing higher order moments in the 
expense of additional computational complexity. 

We define the following quantity for label function /(•) as 

Cf ■= [ f{x)dx. 

Note that in the binary classihcation setting {y E {0,1}), we have E[y|x] := f{x) E [0,1] which is 
always positive, and there is no square of f{x) considered in the above quantity. 
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Let rj denote the noise in the neural network model in d?]) such that the output is 

y = f{x) + V- 

Note that the noise rj is not necessarily independent of x; for instance, in the classification setting 
or binary output y E {0,1}, the noise in dependent on x. 


Conditions for Theorem [3} 

• Non-degeneracy of weight vectors: In the undercomplete setting {k < d), the weight 

matrix Ai E is full column rank and Smin(^i) > C; where Smin(') denotes the minimum 

singular value, and e > 0 is related to the target error in recovering the columns of Ai. In 
the overcomplete setting {k < d^), the Khatri-Rao product Ai 0 Ai E is full column 

ranI0, and Smin(^i © ^i) > e; see Remark [3] for generalization. 

• Conditions on nonlinear activating function cr( ): the coefficients 

\j := E [cr"'{zj)] ■ a 2 (j), Xj := E [cr"{zj)] ■ a 2 {j), j E [k], 

in (j20p and (j40p are nonzero. Here, z := Ajx+^ is the input to the nonlinear operator cr(-)- 
In addition, a{-) satishes the Lipschitz propertjo with constant L such that |(T(n) — (t{u')\ < 
L ■ \u — n'l, for u,u' E M. Suppose that the nonlinear activating function a{z) satisfies the 
property such that a{z) = l—a{—z). Many popular activating functions such as step function, 
sigmoid function and tanh function satisfy this last property. 

• Subgaussian noise: There exists a finite (Jnoise © 0 such that, almost surely, 

E^[exp(ar7)|x] < exp(a^(T^Qigg/2), Va E M, 


where rj denotes the noise in the output y. 

• Bounded statistical leverage: There exists a finite p\ > 1 such that, almost surely, 

Vk 

V^(inf{Aj} + A)A:i,a 

where ki^\ denotes the effective dimensions of the hidden layer h := a(Ajx + bi) as ki^x := 
Here, Aj’s denote the (positive) eigenvalues of the hidden layer covariance matrix 
S/i, and A is the regularization parameter of ridge regression. 


We now elaborate on these conditions. The non-degeneracy of weight veetors are required for the 
tensor decomposition analysis in the estimation of Hi. In the undercomplete setting, the algorithm 
first orthogonalizes (through whitening procedure) the tensor given in ([9]), and then decomposes 
it through tensor power iteration. Note that the convergence of power iteration for o rthogonal 


tensor decomposition is guaranteed (jZhang and Golubl . 120011 : lAnandkumar et al.l . l2014bl ) . For the 


■^It is shown in iBhaskara et ah! (l2013l i that this condition is satisfied under smoothed analysis. 

® If the step function ct{u) — li„>o}(rt) is used as the activating function, the Lipschitz property does not hold 
because of the non-continuity at it = 0. But, we can assume the Lipschitz property holds in the linear continuous 
part, i.e., when it, it' > 0 or it, it' < 0. We then argue that the input to the step function l{u>o}(it) is w.h.p. in the 
linear interval (where the Lipschitz property holds). 
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orthogonalization procedure to work, we need the tensor components (the columns of matrix ^i) 
to be linearly independent. In the overcomplete setting, the algorithm performs the same steps 
with the additional tensorizing procedure; see Appendix |B] for details. In this case, a higher order 
tensor is given to the algorithm and it is first tensorized before performing the same steps as in the 
undercomplete setting. Thus, the same conditions are now imposed on AiQ Ai. 

In addition to the non-degeneracy condition on weight matrix Ai, the coefficients condition on 
Aj’s is also required to ensure the corresponding rank-1 components in Q do not vanish, and thus, 
the tensor decomposition algorithm recovers them. Similarly, the coefficients \j should be also 
nonzero to enable us using the second order moment M 2 in (|39p in the whitening step of tensor 
decomposition algorithm. If one of the coefficients vanishes, we use the other option to perform the 
whitening; see Remark[5]and Procedure [5] for details. Note that the amount of non-linearity of (t{-) 
affects the magnitude of the coefficients. It is also worth mentioning that although we use the third 
derivative notation in characterizing the coefficients \j (and similarly in Aj), we do not 

need the differentiability of non-linear function (t(-) in all points. In particular, when input x is 
a continuous variable, we use Dirac delta function (5(-) as the derivative in non-continuous points; 
for instance, for the derivative of step function l{ 2 ;>o}(®)) we have ^l{ 2 ,>o}(a^) = d{x). Thus, in 
general, we only need the expectations E [(t"'(zj)] and E [o'''(zj)] to exist for these type of functions 
and the corresponding higher order derivatives. 

We impose the Lipschitz condition on the non-linear activating function to limit the error 
propagated in the hidden layer, when the first layer parameters are estimated by the neural network 
and Fourier methods. The condition a(z) = 1 — (t{—z) is also assumed to tackle the sign issue in 
recovering the columns of Ai; see Remark [6] for the details. The subgaussian noise and the bounded 
statistical leverage conditions are standard conditions, required for ridge regression, which is used 
for estimating the parameters of the second layer of the neural network. Both parameters (Tnoise; 
and px affect the sample complexity in the final guarantees. 

Imposing additional bounds on the parameters of the neural network are useful in learning these 
parameters with computationally efficient algorithms since it limits the searching space for training 
these parameters. In particular, for the Fourier analysis, we assume the following conditions. 
Suppose the columns of weight matrix Ai are normalized, i.e., ||(Ai)j|| = 1, j G [k], and the entries 
of first layer bias vector bi are bounded as |6i(0l < 1; ^ Note that the normalization condition 

on the columns of Ai is also needed for identifiability of the parameters. For instance, if the non¬ 
linear operator is the step function a{z) = then matrix Ai is only identifiable up to 

its norm, and thus, such normalization condition is required for identifiability. The estimation 
of entries of the bias vector bi is obtained from the phase of a complex number through Fourier 
analysis; see Procedure [2] for details. Since there is ambiguity in the phase of a complex numbei@, 
we impose the bounded assumption on the entries of bi to avoid this ambiguity. 

Let p{x) satisfy some mild regularity conditions on the boundaries of the support of p{x). In 
particular, all the entries of (matrix-output) functions 

/(x) • V(^)p(a:), V/(a:) • Vp(x)^, f{x) ■ p{x) (11) 


should go to zero on the boundaries of support of yjx). These regularit 

(1201 


conditions are required 
for the properties of the score function to hold; see lJanzamin et al.l (|2014l l for more details. 

In addition to the above main conditions, we also need some mild conditions which are not 
crucial for the recovery guarantees and are mostly assumed to simplify the presentation of the 


°A complex number does not change if an integer multiple of 27r is added to its phase. 
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main results. These conditions can be relaxed more. Suppose the input x is bounded, i.e., x £ Br, 
where Bj. := {a: : ||a:|| < r}. Assume the input probability density function p(x) > ip for some 
^jJ > 0, and for any x £ B^- The regularity conditions in (llip might seem contradictory with the 
lower bound condition p{x) > ip, but there is an easy fix for that. The lower bound on p{x) is 
required for the analysis of the Fourier part of the algorithm. We can have a continuous p{x), while 
in the Fourier part, we only use x’s such that p{x) > ip, and ignore the rest. This only introduces 
a probability factor Pr[x : p{x) > pj] in the analysis. 


Settings of algorithm in Theorem [3} 

• No. of iterations in Algorithm [7) N = Q (log i). 

• No. of initializations in Procedure [8) R > poly(A:). 

• Parameter ei = O (^) ™ Procedure [21 where n is the number of samples. 

• We exploit the empirical second order moment M 2 := ^ Vi ' S 2 {xi), in the whitening 

Procedure [5l which is the first option stated in the procedure. See Remark [5] for further 
discussion about the other option. 

Theorem 3 (NN-LIFT guarantees: estimation bound in the realizable setting). Assume the above 
settings and conditions hold. For e > 0, suppose the number of samples n satisfies (up to log 
factors ) 


n > O I ^ • E 


• poly ymax, 


M3{x)mJ (x) 

E[||52(x)57(x)||] 


( 12 ) 


Cf Xr 


c(^l 


(x)||] ’ ’ Amin’ Smin(^l) 


,m,L, 


l«2| 


( 02)1 


1 ^ 2 !, 


3 ! PX 


See ([32]) . (I55l) . (l571l and dMl) for the complete form of sample complexity. Here, M‘i{x) £ 
denotes the matricization of score function tensor Sz{x) £ see © for the definition of 

matricization. Furthermore, A min := min^gj;.] |Aj|, Amin := oiiUj-gj;,] |Aj|, Amax := |Aj|, 

(a 2 )min := min^gj;.] |a 2 (j)|, and ymax is such that |/(x)| < ymax? for x £ Br. Then the function 
estimate f{x) := {d 2 ,cr{Ajx + bi)) + 62 using the estimated parameters Ai, 61 , 02,62 (output of 
NN-LIFT Algorithmic satisfies the estimation error 

Ex[|/(x)-/(x)p]<d(e2). 


See Section [7] and Appendix ICl for the proof of theorem. Thus, we estimate the neural network 
in polynomial time and sample complexity. This is one of the first results to provide a guaranteed 
method for training neural networks with efficient comp utational and statistical complexity. Note 
that althoug h the sa r nple complexity in ( Barron . 1993l l is smaller as n > O(^), the proposed 
algorithm in ( Barron . 1993l l is not computationally efficient. 


Remark 2 (Sample complexity for Gaussian input). If the input x is Gaussian as x ^ M{0,ld), 
then we know t/iat E [||M 3 (x)M 3 '~(x)||] = O (df) and E [|| 52 (x) 52 ^(x)||] = 0(d^'), and the above 
sample complexity is simplified. 
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Remark 3 (Higher order tensorization). We stated that by tensorizing higher order tensors to lower 
order ones, we can estimate overcomplete models where the hidden layer dimension k is larger than 
the input dimension d. We can generalize this idea to higher order tensorizing such that m modes 
of the higher order tensor are tensorized into a single mode in the resulting lower order tensor. 
This enables us to estimate the models up to k = 0{d"^) assuming the matrix Ai Q ■ ■ ■ Q Ai (m 
Khatri-Rao products) is full column rank. This is possible with the higher computational complexity. 

Remark 4 (Effect of erroneous estimation of p{x)). The input probability density function p{x) is 
directly used in the Fourier part of the algorithm, and also indirectly used in the tensor decomposition 
part to compute the score function Sz{x); see ([8]). In the above analysis, to simplify the presentation, 
we assume we exactly know these functions, and thus, there is no additional error introduced by 
estimating them. It is straightforward to incorporate the corresponding errors in estimating input 
density into the final bound. 

Remark 5 (Alternative whitening prodecure). In whitening Procedure O two options are provided 
for constructing the second order moment M 2 . In the above analysis, we used the first option which 
exploits the second order score function. If any coefficient Xj,j G [k], in (l3^ vanishes, we cannot 
use the second order score function in the whitening procedure, and we use the other option for 
whitening; see Procedure\^ for the details. 


5 Risk Bound in the Non-realizable Setting 


In this section, we provide the risk bound for training the neural network with respect to an 
arbitrary target function; see Section 12.21 for the definition of the risk. 

In order to provide the risk bound with respect to an arbitrary target function, we also need to 
argue the approximation error in addition to the estimation error. For an arbitrary function f{x), 
we need to find a neural network whose error in approximating the function can be bounded. We 
then combine it with the estimation error in training that neural network. This yields the final risk 
bound. 

The approximation problem is about finding a neural network that approximates an arbitrary 
function f{x) with bounded error. Thus, this is different from the realiz able setting where there is a 
fixed neural network and we only analyze its estimation. Barron ( 19931 ) provides an approximation 
bound for the two-layer neural network and we exploit that here. His result is based on the Fourier 
properties of function f{x). Recall from ([5]) the definition of Fourier transform of f{x), denoted 
by F{u}), where u is called the frequency variable. Define the first absolute moment of the Fourier 
magnitude distribution as 

Cf:= [ \\u;\\ 2 -\F{u;)\duj. (13) 

BarronI ( 1993l i analyzes the approximation properties of 


fix)= a2ij)(T{{{Ai)j,x)+ bi{j)), ||(Ai)j|| = 1, |6i(j)| < 1, |a2(j)l < 2(7/,^ G [fc], (14) 

ie[A;] 

where the columns of weight matrix Ai are the normalized version of random frequencies drawn 
from the Fourier magnitude distribution |T(a;)| weighted by the norm of the frequency vector. More 
precisely, 


i.i.d. W 


OJd ^ 

^ c 


f 




UJn 


WUJn 


j G [k]. 


(15) 
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See Section [7.2.II for a detailed discussion on this connection between the columns of weight matrix 
Ai and the random frequency draws from the Fourier magnitude distribution, and see how this 
is argued in the proof of the approximation bound. The other parameters 02,61 need to be also 
found. He then shows the following approximation bound for (jl4[) . 

Theorem 4 (Approximation bound, Theorem 3 of Barron ( 1993I H. For a function f{x) with 
bounded Cf, there exists an approximation f{x) in the form of (fT4]l that satisfies the approximation 
bound 

IE.[|7(x) - /»p] < 0{r^C}) • (^ + <^1 

where f{x) = f{x) — /(O). Flere, for r > 0, 


5r-.= inf < 2 ^ + sup I(T(r 2 ;) - I 


(16) 


is a distance between the unit step function ^{z>o}i^) ^^6 scaled sigmoidal function a{Tz). 

See Barron ( 1993l i for the complete proof of the above theorem. For completeness, we have also 
reviewed the main ideas of this proof in Section 17.21 We now provide the formal statement of our 
risk bound. 

Conditions for Theorem [5} 

• The nonlinear activating function ct(-) is an arbitrary sigmoidal function satisfying the afore¬ 
mentioned Lipschitz condition. Note that a sigmoidal function is a bounded measurable 
function on the real line for which cr{z) —>■ 1 as 2 —)• 00 and a{z) —0 as z ^ — 00 . 

• For e > 0, suppose the number of samples n satisfies (up to log factors) 
k 


n> O 


•E 


Ms{x)mJ{x) 

( E[||52(x)57(x)| 

■ po y yyraazc, 


(17) 


Cf K 


1 


7 Amin 




.(.4,) 


(a2)r 


| 62 |,Crn 


) Px 


where C/ := J^d f{x)‘^dx; notice the difference with Note that this is the same sample 
complexity as in Theorem [3] with i/max substituted with i/max and Cf substituted with Cf- 
The target function f{x) is bounded, and for e > 0, it has bounded Cf as 

/ 1 \ 1 

(18) 


Cf<0 


.x/k 


+ 61 


x/k 


+ e 


E 


|53(x)ir 


poly 


E 

l|53(x)f' 

E 

l|52(x)f' 


'^min (^1) 1 1 (aa) r 


Amax’ W(7ll)’|^^d’^’ ll«2|| ’|^2|V„oiseVA 


See ([60]) and (l62]) for the complete form of bound on Cf. For Gaussian input x ~ 
we have Y^^Tii^53(7)]7l = O and r = 0{x/d). 

See Corollary [T] for examples of functions that satisfy this bound, and thus, we can learn them 
by the proposed method. 
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• The coefficients Xj := E \(T"'{zj)] ■ a 2 {j), and Xj := E \<j"{zj)] ■ a 2 (j), j G [k], in (l20l) and (H0]l 
are non-zero. 

• k random i.i.d. draws of frequencies in Equation (jiffill are linearly independent. Note that 
the draws are from Fourier magnitude distributioqj ||a;|| • |E(a;)|. For more discussions on 
this condition, see Section [7.2.11 and earlier explanations in this section. In the overcomplete 
regime, {k > d), the linear independence property needs to hold for appropriate tensorizations 
of the frequency draws. 

The above requirements on the number of samples n and parameter Cf depend on the pa¬ 
rameters of the neural network Aj, 02 , 61 and 62 - Note that there is also a dependence on these 
parameters through coefficients Xj and Xj. Since this is the non-realizable setting, these neural net¬ 
work parameters correspond to the neural networks that satisfy the approximation bound proposed 
in Theorem |4] and are generated via random draws from the frequency spectrum of the function 

fix)- 

The proposed bound on Cf in (jl 8 h is stricter when the number of hidden units k increases. This 
might seem counter-intuitive, since the approximation result in Theorem [5] suggests that increasing 
k leads to smaller approximation error. But, note that the approximation result in Theorem |4] does 
not consider efficient training of the neural network. The result in Theorem [5] also deals with the 
efficient estimation of the neural network. This imposes additional constraint on the parameter Cf 
such that when the number of neurons increases, the problem of learning the network weights is 
more challenging for the tensor method to resolve. 

Theorem 5 (NN-LIFT guarantees: risk bound). Suppose the above conditions hold. Then the 
target function f is approximated by the neural network f which is learnt using NN-LIFT in Algo¬ 
rithmic satisfying w.h.p. 

Ex[|/(x) - Kx)\^] < 0{r^Cj) • + 0{e% 

where dr is defined in dlSD. Recall X G Br, where Br := {x : ||x|| < r}. 

The theorem is mainly proved by combining the estimation bound guarantees in Theorem [Sj 
and the approximation bound results for neural networks provided in Theorem 01 But note that 
the approximation bound provided in Theorem 0] holds for a specific class of neural networks which 
are not generally recovered by the NN-LIFT algorithm. In addition, the estimation guarantees in 
Theorem [3] is for the realizable setting where the observations are the outputs of a fixed neural 
network, while in Theorem[5]we observe samples of arbitrary function f{x). Thus, the approxima¬ 
tion analysis in Theorem 0] can not be directly applied to Theorem [3l For this, we need additional 
assumptions to ensure the NN-LIFT algorithm recovers a neural network which is close to one of 
the neural networks that satisfy the approximation bound in Theorem 01 Therefore, we impose 
the bound on quantity Cf, and the full column rank assumption proposed in Theorem 01 See 
Appendix ini for the complete proof of Theorem [5l 

The above risk bound includes two terms. The first term 0{r'^C'j) ■ represents the 

approximation error on how the arbitrary function fix) with quantity Cf can be approximated 
by the neural network, whose weights are drawn from the Fourier magnitude distribution; see 

^Note that it should be normalized to be a probability distribution as in (nsi). 
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Theorem U] for the formal statement. From the definition of Cf in m, this bound is weaker 
when the Fourier spectrum of target f{x) has more energy in higher frequencies. This makes 
intuitive sense since it should be easier to approximate a function which is more smooth and has 
less fluctuations. The second term O(e^) is from estimation error for NN-LIFT algorithm, which is 
analyzed in Theorem [3l The polynomial f actors for samp le complexity in our estimation error are 
slightly worse than the bound provided in iBarro^ ( 19941 ). but note that we provide an estirnation 
method which is both computationally and statistically efficient, while the method in iBarronI (jl994l i 
is not computationally efficient. Thus, for the first time, we have a computationally efficient method 
with guaranteed risk bounds for training neural networks. 


Discussion on 5r in the approximation bound: The approximation bound involves a term 
5r which is a constant and does not shrink with increasing the neuron size k. Recall that 6r 
measures the distance between the unit step function the scaled sigmoidal function 

a{Tz) (which is used in the neural network specified in ([7])). We now provide the following two 
observations 

The above risk bound is only provided for the case r = 1. We can generalize this result 
by imposing different constraint on the norm of columns of Ai in (I14jl . In general, if we impose 

||(Ai)j|| = T,j € [k], for some r > 0, then we have the approximation bouncH + 5,-^ • 
Note that 5,- ^ 0 when r —>■ oo (the scaled sigmoidal function a{Tz) converges to the unit step 
function), and thus, this constant approximation error vanishes. 

If the sigmoidal function is the unit step function as a{z) = I{ 2 >o}(^)) then Sr = 0 for all r > 0, 
and hence, there is no such constant approximation error. 


6 Discussions and Extensions 

In this section, we provide additional discussions. We first propose a toy example contrasting the 
hardness of optimization problems backpropagation and tensor decomposition. We then discuss the 
generalization of learning guarantees to higher dimensional output, and also the continuous output 
case. We then discuss an alternative approach for estimating the low-dimensional parameters of 
the model. 

6.1 Contrasting the loss surface of backpropagation with tensor decomposition 

We discussed that the computational hardness of training a neural network is due to the non¬ 
convexity of the loss function, and thus, popular local search methods such as backpropagation 
can get stuck in spurious local optima. We now provide a toy example highlighting this issue, and 
contrast it with the tensor decomposition approach. 

We consider a simple binary classification task shown in Figure Ela, where blue and magneta 
data points correspond to two different classes. It is clear that these two classes can be classified by 
a mixture of two linear classifiers which are drawn as green solid lines in the figure. For this task, we 
consider a two-layer neural network with two hidden neurons. The loss surfaces for backpropagation 
and tensor decomposition are shown in Figures EJb andEJc, respectively. They are shown in terms 

®Note that this change also needs some straightforward appropriate modifications in the algorithm. 
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(b) Loss surface for backprop. 


(c) Loss surface for tensor method 


Figure 2: (a) Classification task: two colors correspond to binary labels. A two-layer neural network with 
two hidden neurons is used. Loss surface in terms of the first layer weights of one neuron (i.e., weights 
connecting the inputs to the neuron) is plotted while other parameters are fixed, (b) Loss surface for usual 
square loss objective has spurious local optima. In part (a), one of the spurious local optima is drawn as red 
dashed lines and the global optimum is drawn as green solid lines, (c) Loss surface for tensor factorization 
objective is free of spurious local optima. 


of the weight parameters of inputs to the first neuron, i.e., the first column of matrix Ai, while the 
weight parameters to the second neuron are randomly drawn, and then fixed. 

The stark contrast between the optimization landscape of tensor objective function, and the 
usual square loss objective used for backpropagation are observed, where even for a very simple 
classihcation task, backpropagation suffers from spurious local optima (one set of them is drawn as 
red dashed lines), which is not the case with tensor methods that is at least locally convex. This 
comparison highlights the algorithmic advantage of tensor decomposition compared to backpropa¬ 
gation in terms of the optimization they are performing. 

6.2 Extensions to cases beyond binary classification 

We earlier limited ourselves to the case where the output of neural network y G {0,1} is binary. 
These results can be easily extended to more complicated cases such as higher dimensional output 
(multi-label and multi-class), and also the continuous outputs (i.e., regression setting). In the rest 
of this section, we discuss about the necessary changes in the algorithm to adapt it for these cases. 
In the multi-dimensional case, the output label y is a vector generated as 

E[y\x] = Ala{Ajx + bi) + 62 , 

where the output is either discrete (multi-label and multi-class) or continuous. Recall that the al¬ 
gorithm includes three main parts: tensor decomposition, Fourier and ridge regression components. 

Tensor decomposition: For the tensor decomposition part, we first form the empirical version 
of T = E [y (g) 53 (x)]; note that 0 is used here (instead of scalar product used earlier) since y is not 
a scalar anymore. By the properties of score function, this tensor has decomposition form 

r = E [y (g) 53 (x)] = ^ E [a'''{zj)] ■ (^ 2 )-^ (g) {Ai)j (g) {Ai)j (g) {Ai)j, 
ie[fe] 
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where (^ 2 )'^ denotes the row of matrix A 2 . This is proved similar to Lemma [6l The tensor T is 
a fourth order tensor, and we contract the first mode by multiplying it with a random vector 0 as 
T(0, 1,1,1) leading to the same form in (jl9p as 

f{e, 1,1,1) ® (^i)i ® 

ie[A;] 

with Xj changed to Xj = E [a"'{zj)] -{{A 2 y, 0). Therefore, the same tensor decomposition guarantees 
in the binary case also hold here when the empirical version of T{0,I,I,I) is the input to the 
algorithm. 

Fourier method: Similar to the scalar case, we can use one of the entries of output to estimate 
the entries of hi. There is an additional difference in the continuous case. Suppose that the output 
is generated as y = f{x) + y where rj is noise vector which is independent of input x. In this case, 
the parameter corresponding to 1 “' entry of output yi is changed to Cj := fj^d f{x)‘fdx + rjfdt. 

Ridge regression: The ridge regression method and analysis can be immediately generalized to 
non-scalar output by applying the method independently to different entries of output vector to 
recover different columns of matrix A 2 and different entries of vector 62 - 


6.3 An alternative for estimating low-dimensional parameters 


Once we have an estimate of the first layer weights Ai, we can greedily (i.e., incrementally) add 
neurons with the weight vectors iAi)j for j G [k], and choose the bias 61 (j) through grid search, 
a nd learn its co ntribution a 2 {j) for its final output. This is on the lines of the method proposed 
Barron ( 1993l f . with one crucial difference tha t in our case, the first layer weights Ai are already 


m 


estimated by the tensor method. Barron ( 1993l f proposes optimizing for each weight vector {Ai)j 
in d-dimensional space, whose computational complexity can scale exponentially in d in the worst 
case. But, in our setup here, since we have already estimated the high-dimensional parameters 
(i.e., the columns of Ai), we only need to estimate a few low dimensional parameters. For the 
new hidden unit indexed by j, these parameters include the bias from input layer to the neuron 
(i.e., 61 (j)), and the weight from the neuron to the output (i.e., 02 (j)). This makes the approach 
computationally tractable, and we can even use brut e-force or exha ustive search to find the best 
parameters on a finite set and get guarantees akin to ( Barronl . 199^. 


7 Proof Sketch 

In this section, we provide key ideas for proving the main results in Theorems [3] and 01 

7.1 Estimation bound 

The estimation bound is proposed in Theorem[3l and the complete proof is provided in AppendixICl 
Recall that NN-LIFT algorithm includes a tensor decomposition part for estimating Ai, a Fourier 
technique for estimating bi , and a linear regression for estimating 02,62 • The application of linear 
regression in the last layer is immediately clear. In this section, we propose two main lemmas 
which clarify why the other methods are useful for estimating the unknown parameters Ai , bi 
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in the realizable setting, where the label y is generated by the neural network with the given 
architecture. 

In the following lemma, we show how the cross-moment between label and score function as 
E[y • 53 ( 3 :)] leads to a tensor decomposition form for estimating weight matrix Ai. 

Lemma 6. For the two-layer neural network specified in dZD, we have 


where {Ai)j E 


E [y ■ 53(x)] = ^ Aj • {Ai)j (g) {Ai)j (g) iAi)j, 

ie[fc] 

denotes the j-th column of Ai, and 

Xj = E [a'''{zj)] ■ a 2 {j), 


(19) 


( 20 ) 


for vector z := AJx bi as the input to the nonlinear operator (t(-). 


This is proved by the main property of score functions as yielding differential operators, where 
for label-function f{x) := E[7/|x], we have E[y • 53(x)] = E[vi?V(a^)] Manzamin et ah . 2014l h see 
Section IC.ll for a complete proof of the lemma. This lemma shows that by decomposing the cross¬ 
moment tensor E[y • 53(x)], we can recover the columns of Ai. 

We also exploit the phase of complex number v to estimate the bias vector bi ; see Procedure [2j 
The following lemma clarifies this. The perturbation analysis is provided in the appendix. 


Lemma 7. Let 


1 . ^ 

n ^ p{xi) 

i&[n\ 


( 21 ) 


Notice this is a realizable of v in Procedure where the output corresponds to a neural network y. 
If cvi’s are uniformly i.i.d. drawn from set Hi, then v has mean (which is computed over x, y and 
uj) 

"^>’’1 = ( i ) 

where denotes the surface area of d — 1 dimensional manifold Hi, and S(-) denotes the Fourier 
transform of a{-). 


This lemma is proved in Appendix 1C. 2 1 


7.2 Approximation bound 


We exploit the approximation bound argued in iBarronI (jl993l f provided in Theorem 01 We first dis¬ 
cuss his main result arguing an approximation bound 0{r‘^C‘j/k) for a function f{x) with bounded 
parameter Cf; see ()13p for the definition of Cf. Note that this corresponds t o the first term in 
the approximation error proposed in Theorem For this result, Barron ( 1993l f does not consider 
any bound on the parameters of first layer Ai and bi. He then provides a refinement of this result 
where he also bounds the parameters of neural network as we also do in (|14l) . This leads to the 
additional term involving 5r in the approximation error as seen in Theorem [H Note that bounding 
the parameters of neural network is also useful in learning these parameters with computation¬ 
ally efficient algorithms since it limits the searching space for training these parameters. We now 
provide the main ideas of proving these bounds as follows. 
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7.2.1 No bounds on the parameters of the neural network 


We first provide the proof outline when there is no additional constraints on the parameters of 
neural network; see set G defined in (1241). and coin pare it with the form we use in (|14p where there 
are additional bounds. In this case. Iharronl ( 1993l i argues approximation bound 0{r‘^C‘j/k) which 
is proved based on two main results. The first result says that if a function / is in the closure of 
the convex hull of a set G in a Hilbert space, then for every /c > 1, there is an as the convex 
combination of k points in G such that 




(23) 


for any constant d satisfying some lower bound related to the properties of set G and function /; 
see Lemma 1 in Barron ( 1993l i for the precise statement and the proof of this result. 

The second part of the proof is to argue that arbitrary function / G T (where T denotes the set 
of functions with bounded Gj) is in the closure of the convex hull of sigmoidal functions 


G := {7(T((a, x) + (5) : a G /3 G M, I 7 I < 2C}. 


(24) 


BarronI (jl993l i proves this result by arguing the following chain of inclusions as 


T C cl Geos C clGgtep C clG, 

where clG denotes the closure of set G, and set s Gms and Gs ten respectively denote set of some 
sinusoidal and step functions. See Theorem 2 in IBarronl (1993|) for the precise statement and the 
proof of this result. 

Random frequency draws from Fourier magnitude distribution: Recall from Section [5] 
that the columns of weight matrix Ai are the normalized version of random frequencies drawn from 
Fourier magnitude distribution ||u;|| • |F(a;)|; see Equation (fT^ . T his connection is along the proof 
of relation T C cl Geos that we recap here; see proof of Lemma 2 in BarronI ( 19931 ) for more details. 
By expanding the Fourier transform as magnitude and phase parts F{uj) = eF^^'>\F{oj)\, we have 


where 


f{x) := f{x) - /(O) = J g{x,uj)A{du}), 


A{uj) := llwll • \F{u;)\/Gf 


(25) 


(26) 


is the normalized Fourier magnitude distribution (as a probability distribution) weighted by the 
norm of frequency vector, and 


g{x, (jj) := (cos((a;, x) + 6{(jj)) — cos{9{u))). 

The integral in (I25p is an infinite convex combination of functions in the class 


Gpos .— 


_7_ 

|a;| 


(cos((a;, x) + /3) - cos(/3)) : cj / 0 , 17 I < G,/3 e 
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Now if cji,a; 2 ,..., Wfc is a random sample of k points independently drawn from Fourier magnitude 
distribution A, then by Fubini’s Theorem, we have 


E 




1 

k 


X] 9{x,u:j) 

je[fc] 


2 

^{dx) < —, 


where /i(-) is the probability measure for x. This shows function / is in the convex hull of Geos- 
Note that the bound ^ complies the bound in (1231) . 


7.2.2 Bounding the parameters of the neural network 

Barron ( 1993l i then imposes additional bounds on the weights of first layer, considering the following 


class of sigmoidal functions as 

Gr := {7(T(r((a,x) +/3)) : ||a|| < 1, |^| < 1, ly] < 2G}. 


(27) 


Note t hat the appro ximation proposed in (I14p is a convex combination of k points in (I27|) with 
r = 1. BarronI ( 1993l i concludes Theorem 0] by the following lemma. 

Lemma 8 (Lemma 5 in BarronI ( 1993I B. If g is a function on [—1,1] with derivative bounded by 
a constant G, then for every r > 0, we have 


inf sup \g{z) — gT{z)\ < 2G5r- 

S-rSclGr |5;|<T 


Finally Theorem 0] is proved by applying triangle inequality to bounds argued in the above two 


cases. 


8 Conclusion 

We have proposed a novel algorithm based on tensor decomposition for training two-layer neural 
networks. This is a computationally efficient method with guaranteed risk bounds with respect to 
the target function under polynomial sample complexity in the input and neuron dimensions. The 
tensor method is embarrassingly parallel and has a parallel time computational complexity which 
is logarithmic in input dimension which is comparable with parallel stochastic backpropagation. 
There are number of open problems to consider in future. Extending this framework to a multi-layer 
network is of great interest. Exploring the score function framework to train other discriminative 
models is also interesting. 
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A Tensor Notation 

In this Section, we provide the additional tensor notation required for the analysis provided in the 
supplementary material. 

Multilinear form: The multilinear form for a tensor T G ]R'JiX'?2xg3 jg j^gfij^ed as follows. Con¬ 
sider matrices Mr G G {1,2,3}. Then tensor T{Mi, M 2 , M^) G MPi^P 2 xp 3 jg (^gfing^j 

as 

r(Mi,M2,M3) := E E E 0 ® -^2(j2, 0 ® Ms{js ,:). (28) 

iie[iji]i26[i}2]i3e[ij3] 

As a simpler case, for vectors u,v,w G we have0 

T{I,v,w) := VjWiT{:,j,l) G M'^, (29) 

j,ie[d] 

which is a multilinear combination of the tensor mode-1 fibers. 


B Details of Tensor Decomposition Algorithm 


The goal of tensor decomposition algorithm is to recover the rank-1 components of tensor; refer 
to Equation ([3|) for the notion of ten sor rank and its rank-1 com ponents. We exploit the tensor 


decomposition algorithm proposed in ( Anandkumar et ah . 2ni4bl ^). Figure [3] depicts the flowchart 
of this method where the corresponding algorithms and procedures are also specified. Similarly, 
Algorithm |4] states the high-level steps of tensor decomposition algorithm. The main step of the 
tensor decomposition method is the tensor power iteration which is the generalization of matrix 
power iteration to 3rd order tensors. The tensor power iteration is given by 


u 


T{I,u,u) 

\\T{I,u,u) 


where u G M'^, T{I, u, u) := UjUiT{:,j, 1) G is a multilinear combination of tensor fibers. 

Note that tensor fibers are the vectors which are derived by fixing all the indices of the tensor 
except one of them, e.g., T{:,j,l) in the above expansion. The initialization for different runs of 

^'’Compare with the matrix case where for M € we have M(7, u) = Mu ~ j). 
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Input: Tensor T = X]jg[fc] 

i 

Whitening procedure (Procedure [S]) 

i 

SVD-based Initialization (Procedure [5]) 

i 

Tensor Power Method (Algorithm 0 ) 

4- 

Output: 


Figure 3: Overview of tensor decomposition algorithm for third order tensor (without tensorization). 


Algorithm 4 Tensor Decomposition Algorithm Setup 
input symmetric tensor T. 

1: if Whitening then 

2: Calculate T =Whiten(r); see Procedure El 

3: else if Tensorizing then 
4 : Tensorize the input tensor. 

5 : Calculate T =Whiten(T); see Procedure El 

6 : end if 

7 : for j = 1 to k do 

8: {vj,fij,T) = tensor power decomposition(T); see Algorithm 0 

9: end for 

10: {Ai)j = Un-whiten(uj), J G [A:]; see Procedure El 

11; return 


tensor power iteration is performed by the SVD-based technique proposed in Procedure El This 
helps to initialize non-convex tensor power iteration with good initialization vectors. The whitening 
preprocessing is applied to orthogonalize the components of input tensor. Note that the convergence 
guarantees of tensor power itera t ion for orthogonal tensor decomp osition have been developed in 


the literature (|Zhang and Golubl . l200ll : lAnandkumar et al.l . 12014 


mp c 

3 - 


The tensorization step works as follows. 


Tensorization: The tensorizing step is applied when we want to decompose overcomplete tensors 
where the rank k is larger than the dimension d. For instance, for getting rank up to A; = 0{d‘^), 
we first form the 6th order input tensor with decomposition as 

6 

T=Y^ G (g) M^. 

ie[A:] 

Given T, we form the 3rd order tensor T G which is the tensorization of T such that 

f[i2 + d{ii - l),j2 + d{ji - l),l2 + d{h - 1)) := T{ii,i2,ji,j2,h,h)- (30) 
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Procedure 5 Whitening 
input Tensor T E 

1 : Second order moment M 2 E is constructed such that it has the same decompositon form 
as target tensor T (see Section IC. 1.1 1 for more discussions): 

• Option 1: constructed using second order score function; see Equation ([3^ . 

• Option 2: computed as M 2 := T{I, /, 6) E where 9 ~ A^(0, Id) is a random standard 

Gaussian vector. 

2 : Compute the rank-k SVD, M2 = f/Diag( 7 )where U E and 7 E M^. 

3: Compute the whitening matrix W := C/Diag( 7 “^/^) E 
4: return T {W,W,W) . 


Procedure 6 Un-whitening 


uj ^ ji-i ^, j E [k]. 


input Orthogonal rank-1 components Vj E 
1 : Consider matrix M 2 which was exploited for whitening in Procedure EJ and let Xj,j E [A:] denote 
the corresponding coefficients as M 2 = Ai Diag(A) 2 l)''; see ([39]) . 

2 : Compute the rank-k SVD, M2 = UDiag( 7 )where U E R'^^^ and 7 E R^. 

3: Compute 

{Ai)j = ^Ut/Diag( 7 ^/^)uj, j E [k]. 


4: return {{Ai)j}.^^^y 


This leads to T having decomposition 

j&[k] 

We then apply the tensor decomposition algorithm to this new tensor T. This now clarifies why the 
full column rank condition is applied to the columns oi AQ A = [oi 0 oi • • • Ofc 0 a^]. Similarly, we 
can perform higher order tensorizations leading to more overcomplete models by exploiting initial 
higher order tensor T; see also Remark [3l 


Efficient implementation of tensor decomposition given samples: The main update steps 
in the tensor decomposition algorithm is the tensor power iteration for which a multilinear oper¬ 
ation is performed on tensor T. However, the tensor is not available beforehand, and needs to be 
estimated using the samples (as in Algorithm [1] in the main text). Computing and storing the 
tensor can be enormously expensive for high-dimensional problems. But, it is essential to note 
that since we can form a factor form of tensor T using the samples and other parameters in the 
model, we can manipulate the samples directly to perform the power update as multi-linear op- 
erati ons without explicitly form ing the tensor. This leads to efficient computational complexity. 


See (jAnandkumar et al.l . l2014al ) for details on these implicit update forms. 
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Algorithm 7 Robust tensor power method (!Anandkumar et al.. 2014bli 


input symmetric tensor T E ^d'xd'xd' ^ number of iterations N, number of initializations R. 
output the estimated eigenvector/eigenvalue pair; the deflated tensor. 

1: for r = 1 to i? do 

2: Initialize Vq'^^ with SVD-based method in Procedure El 

3: for t = 1 to A do 

4: Compute power iteration update 






V 


(t) 

t-1 


(31) 


5: end for 

6 : end for 

7: Let T* := argmax^g[K]{r(u^\uj^\u^^)}. 

8: Do N power iteration updates ([3T]l starting from to obtain v, and set fl := T{v,v,v). 
9: return the estimated eigenvector/eigenvalue pair the deflated tensor T — jl ■ 


Procedure 8 SVD-based initialization (Anandkurnar et ah, 12014c) 

input Tensor T € 

1: for r = 1 to log(l/h) do 

2: Draw a random standard Gaussian vector ~ AA(0, Id')- 

3: Compute as the top left singular vector of E . 

4: end for 
5: To ^ 

6: return vq. 


C Proof of Theorem [3] 

Proof of Theorem El includes three main pieces which is about arguing the recovery guarantees of 
three different parts of the algorithm: tensor decomposition, Fourier method, and linear regression. 
As the first piece, we show that the tensor decomposition algorithm for estimating weight matrix 
Al (see Algorithm [T] for the details) recovers it with the desired error. In the second part, we 
analyze the performance of Fourier technique for estimating bias vector bi (see Algorithm [1] and 
Procedure [2] for the details) proving the error in the recovery is small. Finally as the last step, the 
ridge regression is analyzed to ensure that the parameters of last layer of the neural network are 
well estimated leading to the estimation of overall function f{x). We now provide the analysis of 
these three parts. 


C.l Tensor decomposition guarantees 

We first provide a short proof for Lemma [6] which shows how the rank-1 components of third order 
tensor E [y ■ 53(x)] are the columns of weight matrix A i. _ 

Proof of Lemma [6l It is shown by Janzamin et al. ( 2014l l that the score function yields differ- 
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ential operator such that for label-function f(x) := E[7/|x], we have 

E[2/.53(x)]=E[Vi3)/(x)]. 

Applying this property to the form of label function /(x) in ([7]) denoted by f{x), we have 


E[y.53(x)]=E[c7'"(-)(a2,A^A^A^ )], 

where denotes the third order derivative of element-wise function (7{z) : —)• More 

concretely, with slightly abuse of notation, cr'"{z) G jg diagonal 4th order tensor with 

ite J-th diagonal entry eqnal to : M ^ R, Here two proportroa are nsed to oompnto the 

third order derivative f{x) on the R.H.S. of above equation as follows. 1) We apply chain rule 
to take the derivatives which generates a new factor of Ai for each derivative. Since we take 3rd 
order derivative, we have 3 factors of Ai. 2) The linearity of next layers leads to the derivatives 
from them being vanished, and thus, we only have the above term as the derivative. Expanding 
the above multilinear form finishes the proof; see (1281) for the definition of multilinear form. □ 
We now provide the recovery guarantees of weight matrix Ai through tensor decomposition as 
follows. 


Lemma 9. Among the conditions for Theorem consider the rank constraint on Ai, and the 
non-vanishing assumption on coefficients Xj’s. Let the whitening to he performed using empirical 
version of second order score function as specified in (l39]l . and assume the coefficients Xj’s do not 
vanish. Suppose the sample complexity 



.E 


•E 


Ms{x)mJ{x 
M^{x)mJ{x 


\4 

^max 


c(^l) 


■ ^min ' 


^min • Smin(^l) 




(32) 


holds, where Ms{x) G denotes the matricization of score function tensor 53(x) G 

see © for the definition of matricization. Then the estimate Ai by NN-LIFT Algorithmic satisfies 

w.h.p. 

niin ||(Ai)j - z • {Ai)j\\ < 0(ei), j G [k], 
ze{±i} 

where the recovery guarantee is up to the permutation of columns of Ai. 


Remark 6 (Sign ambiguity). We observe that in addition to the permutation ambiguity in the 
recovery guarantees, there is also a sign ambiguity issue in recovering the columns of matrix Ai 
through the decomposition of third order tensor in dUD. This is because the sign of {Ai)j and 
coefficient Xj can both change while the overall tensor is still fixed. Note that the coefficient Xj can 
be positive or negative. According to the Fourier method for estimating hi, mis-calculating the sign 
of {Ai)j also leads to sign of bi{j) recovered in the opposite manner. In other words, the recovered 
sign of the bias bi{j) is consistent with the recovered sign of {Ai)j. 
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Recall we assume that the nonlinear activating function a{z) satisfies the property such that 
a(z) = 1 — a{—z). Many popular activating functions such as step function, sigmoid function and 
tanh function satisfy this property. Given this property, the sign ambiguity in parameters Ai and 
bi whieh leads to opposite sign in input z to the aetivating function cr(-) can he now compensated 
by the sign of 02 and value ofb 2 , which is recovered through least squares. 


Proof of Lemma [9} From Lemma[6l we know that the exact cross-moment T = E[y • 53(x)] has 
rank-one components as columns of matrix Ai; see Equation (jl9h for the tensor decomposition form. 
We apply a tensor decomposition method in NN-LIFT to estimate th e columns of Ai. We employ 
noisy tensor decomposition guarantees in Anandkumar et ah ( 2ni4(j ). They show that when the 
perturbation tensor is small, the tensor power iteration initialized by the SVD-based Procedure [8] 
recovers the rank-1 components up to some small error. We also analyze the whitening step and 
combine it with this result leading to Lemma fTOl 

Let us now characterize the perturbation matrix and tensor. By Lemma [6l the CP decomposi¬ 
tion form is given by T = E[y • 53 (x)], and thus, the perturbation tensor is written as 


E := r - f = E[y • 53(x)] S:i{xi), 

n 

ie[n] 


(33) 


where T = ^ Z)je[n] Vi ' is the empirical form used in NN-LIFT Algorithm [TJ Notice that in 

the realizable setting, the neural network output y is observed and thus, it is used in forming the 
empirical tensor. Similarly, the perturbation of second order moment M 2 = E[y • 52(x)] is given by 

E 2 := M 2 - M 2 = E[y • 52 (x)] - - V y* • 52(xi). (34) 

n 

ie[n] 


In order to bound ||A||, we matricize it to apply matrix Bernstein’s inequality. We have the 
matricized version as 

E := E[y ■ M^ix)] - M^(xi) = ^ “ Vi ' Msixi)^, 

i&[n] ie[n] 


where M^{x) E is the matricization of 53(x) E see ([T]) for the definition of matri- 

cization. Now the norm of E can be bounded by the matrix Bernstein’s inequality. The norm of 
each (centered) random variable inside the summation is bounded as ^^^E[||M 3 (x)||], where ymax 
is the bound on \y\. The variance term is also bounded as 



^ E \yf • M3(x,)M3^(x, 


ie[n] 


1 -2 

— “l/max 


E 


M3ix)Mj{x) 


Applying matrix Bernstein’s inequality, we have w.h.p. 

||A|| < ||E|| <o(^^^E[||M3(x)M3T(x)||]) . (35) 

For the second order perturbation E 2 , it is already a matrix, and by applying matrix Bernstein’s 
inequality, we similarly argue that w.h.p. 

IIE 2 II < O (^^E[||52(x)52T(x)||]) . (36) 
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There i s one more remaini n g piec e to complete the proof of tensor decomposition part. The 
analysis in lAnandknmar et al.l (l2ni4dl does not involv e any w hitening step, and thus, we need to 
adapt the perturbation analysis of lAnandknmar et al.l (j2014g) to our additional whitening proce¬ 
dure. This is done in LemmafTOl In the final recovery bound (j46p in Lemma fTOl there are two terms; 
one involving ||iil||, and the other involving ||T' 2 ||- We first impose a bound on sample complexity 
such that the bound involving ||i?|| dominates the bound involving ||ii' 2 || as follows. Considering 
the bounds on ||i?|| and IILI 2 II in (1351) and (j36p . and imposing the lower bound on the number of 
samples (third bound stated in the lemma) as 


13/2 


n> O 


E[||52(x)52^(x) 
E[||M3(x)M3T(x)||]'/' 


1 


\2 

'^min 


.(^ 1 ) 


leads to this goal. By doing this, we do not need to impose the bound on ||E' 2 || anymore, and 
applying the perturbation bound in (j35p to the required bound on ||ill|| in Lemma [10] leads to 
sample complexity bound (second bound stated in the lemma) 


n> O 


•E 


Ms{x)mJ{x 


Ar 


1 


A. 


\2 . 

^min '^min 


(All 


■k 


Finally, applying the result of Lemma fTOl we have the column-wise error guarantees (up to permu¬ 
tation) 


||( Ai ),--( Ai ),||<0 


•^max (^ 1 ) AL 


2/max 


E[ M3{x)MJ{x) 


VAmin ^mfn ' Smin(^l) 


n 


<0{h), 


where in the first inequality we also substituted the bound on ||iT|| in (|35p . and the first bound on 
n stated in the lemma is used in the last inequality. □ 


C.1.1 Whitening analysis 

The perturbation analysis of proposed tensor decomposition met hod in Algorithmic with the corre¬ 
sponding SVD-based initialization in Procedure [8] is provided in Anandkumar et al. ( 2014 cI L But, 
they do not consider the effect of whitening proposed i n Proc edures [5] and |6l Thus, we need to 
adapt the perturbation analysis of lAnandknmar et al.l (j2014cl l when the whitening procedure is 
incorporated. We perform it in this section. 

We first elaborate on the whitening step, and analyze how the proposed Procedure [5] works. 
We then analyze the inversion of whitening operator showing how the components in the whitened 
space are translated back to the original space as stated in Procedure [H We finally provide the 
perturbation analysis of whitening step when estimations of moments are given. 


Whitening procednre: Consider second order moment M 2 which is used to whiten third order 
tensor 

^ • (^i)i ® (^i)i ® (^ 1 )/ (37) 

ie[fe] 
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in Procedure [3 It is constructed such that it has the same decomposition form as target tensor T, 
i.e., we have 

M 2 = ^ Aj • {A\)j (g) (38) 

is [A:] 

We propose two options for constructing M 2 in Procedure [3 First option is to use second order 
score function and construct M 2 := E [y • 52 (x)] for which we have 

M 2 := E [y • 52 (x)] = ^ Aj • {Ai)j ® (39) 

ie[fc] 

where 

A, =E[ct"(z,)] •a 2 (j), (40) 

for vector z := Alx + bi as the input to the nonlinear operator ct(-). This is proved similar 
to Lemma [3 Second option leads to the same form for M 2 as (|38p with coefficient modified as 

Let matrix W G denote the whitening matrix in the noiseless case, i.e., the whitening 

matrix W in Procedure [3 is constructed such that W~^M 2 W = Ik- Applying whitening matrix W 
to the noiseless tensor T = ^3 ' (^i)i ® ® have 

f[W,W,W) = Y. (»'‘"(.4i)_,)®‘‘ =Y^- = E ■ (“1) 

j&[k] ie[fc] M ^ ^ ie[fc] 

where we define 

~3/2 ’ ^3 ^ i ^ [^]> (42) 

in the last equality. Let V := [ui V 2 • • • t'fc] G denote the factor matrix for T(IF, IF, IF). We 
have 

F := IF^AiDiag(A^/ 2 )^ ( 43 ) 

and thus, 

FpT ^ Diag(A)A7IF = M 2 W = 4 . 

Since F is a square matrix, it is also concluded that F''~F = Ik, and therefore, tensor T(IF, IF, IF) is 
whitened such that the rank-1 components u^’s form an orthonormal basis. This discussion clarifies 
how the whitening procedure works. 

Inversion of the whitening procedure: Let us also analyze the inversion procedure on how 
to transform Uj’s to (Ai)j’s. The main step is stated in Procedure [3 According to whitening Pro¬ 
cedure 13 let M 2 = U Diag{'y)U~^, U G 7 G denote the rank-k SVD of M 2 - Substituting 

whitening matrix IF ;= t/Diag( 7 “^/^) in (j43p . and multiplying 17Diag(7^/^) from left, we have 

t/Diag(7^/2)F = UU^Ai Diag(A^/2). 

Since the column spans of Ai G and U G are the same (given their relations to M 2 ), 

Ai is a fixed point for the projection operator on the subspace spanned by the columns of U. 
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This projector operator is UU~^ (since columns of U form an orthonormal basis), and therefore, 
UU~^Ai = Ai. Applying this to the above equation, we have 


i.e., 


(^i)i = j G [A:]. 


(44) 


The above discussions describe the details of whitening and unwhitening procedures. We now 
provide the guarantees of tensor decomposition given noisy versions of moments M 2 and T. 


Lemma 10. Let M 2 = M 2 — E 2 and T = T — E respectively denote the noisy versions of 

M2 iAi)j <S> iAi)j, f=^Xj- {Ai)j <S> iAi)j <S> iAi)j. 

j&[k] je[k] 

Assume the second and third order perturbations satisfy the bounds 


(45) 


1/3 


:7/6 


\\E2\\<0\yj^: 


ll^ll <o\x. 


A 1/U 

^min 2 


Va; 
f ~x 


AM) 


A;V6 


1.5 






Then, the proposed tensor decomposition algorithm recovers estimations of rank-1 components 
{Ai)j’s satisfying error 


||(Ai),-(ii),|| <0 


c(^l) 


A„ 


\2 

^max 

VXmm 


\\E 2 f 


3 , 3.5 

^min 


®min(^l) A. 


I£^l 


1.5 


.(^ 1 ) 


, j G [k]. (46) 


Proof: We do not have access to the true matrix M 2 and the true tensor T, and the perturbed 

versions M 2 = M 2 — E 2 and T = T — E are used in the whitening procedure. Here, E 2 G 
denotes the perturbation matrix, and E G denotes the perturbation tensor. Similar to the 

noiseless case, let W G denotes the whitening matrix constructed by Procedure [5] such that 

W~^M 2 W = Ifc, and thus it orthogonalizes the noisy matrix M 2 . Applying the whitening matrix 
W to the tensor T, we have 

f (W, W, W) = f{W, W, W) - f{W -W,W -W,W -W) - E{W, W, W) 

ie[fc] 

where we used Equation (ITT]) , and we defined 

Ew ■.= f{W -W,W -W,W -W) + E{W,W,W) (48) 


as the perturbation tensor after whitening. Note that the perturbation is from two sources; one is 
from the error in computing whitening matrix reflected in VE — W, and the other is the error in 
tensor T reflected in E. 
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We know that the rank-1 components Vj’s form an orthonormal basis, a nd thus, we have a 


noisy o rthogonal tensor decomposition problem in (I47p . We apply the result of lAnandknmar et al 
( 2014cl i where they show that if 

II U, II ^ ^ 

W^wW <- jf —, 

aov k 

for some constant aQ> 1, then the tensor power iteration (applied to the whitened tensor) recovers 
the tensor rank-1 components with bounded error (up to the permutation of columns) 

||„ i; II < 6 (ILM) , (49) 

V limin J 

We now relate the norm of E\y to t he norm of origin al perturbations E and £' 2 . For the first 
, from Lemmata 4 and 5 of ISong et alJ (j2013l i. we have 

- ^ ^ ^ 64||£2f 


term in 


||r(w -w,w -w,w -w)\\ < 


1.5 . „6 fA 

^'min *^111111 

For the second term, by the sub-multiplicative property we have 
\\E{W,W,W)\\ < ||£|| • \\Wf < 8||£|| • ||Wf < 


8 ||£|| 




3/2 ■ 
min 


Here in the last inequality, we used 


< 


Sk{M2) 


x(^l)V^ 


where Sk{M 2 ) denotes the A:-th largest singular value of M 2 . Here, the equality is from the definition 
of W based on rank-/c SVD of M 2 in Procedure [5l and the inequality is from M 2 = Ai Diag(A)H)''. 
Substituting these bounds, we finally need the condition 


64||£2|P 


+ 


8 ||£|| 


AmitiVlogfc 

^mfn • 'Smin(^l) ' ^mfn ' 'SLn(^l) ~ «oA^Lv^ ’ 


< 


where we also substituted bound /Umin > Amin/A^lx) given Equation (|l2]l . The bounds stated in the 
lemma ensures that each of the te rms on the left hand side of the inequality are bounded by the right 
hand side. Thus, by the result of lAnandknmar et al.l (j2014cl L we have \\vj — Vj\\ < O (||£w||/limin)- 
On the other hand, by the unwhitening relationship in (|44p . we have 


||(Ai)j-(ii)j|| = -^||Diag(7^/2)-[uj-%]|| < 

a/Aa V '^min 


\'^j '^jW — ^max(-^l) 


'A, 


An 


(50) 

where in the equality, we use the fact that orthonormal matrix U preserves the ^2 norm, and the 
sub-multiplicative property is exploited in the first inequality. The last inequality is also from 
7max = Smax(M2) < Smax(^i) ' Amax, which is from M2 = Ai Diag(A)A7. Incorporating the error 
bound on \\vj — Vj\\ in ([49]), we have 

11 11 ^ ^ Q f Sma,x{Ai) A^ 

f^min I 


ll(Ali),-(ii) 11 ^ O I Smax (^i) 


'A, 


A. 


A. 








where we used the bound //min > Amin/Amix ™ the last step. 


□ 
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C.2 Fourier analysis guarantees 

The analysis of Fourier method for estimating parameter bi includes the following two lemmas. In 
the first lemma, we argue the mean of random variable v introduced in Algorithm [ 1 ] in the realizable 
setting. This clarifies why the phase of v is related to unknown parameter bi. In the second lemma, 
we argue the concentration of v around its mean leading to the sample complexity result. Note 
that V is denoted by v in the realizable setting. 

The Fourier method can be also used to estimate the weight vector 02 since it appears in the 
magnitude of complex number v. In this section, we also provide the analysis of estimating 02 with 
Fourier method which can be used as an alternative, while we primarily estimate 02 by the ridge 
regression analyzed in Appendix 1C.81 


Lemma [3 (Restated). Let 


V 


I 

n 


E 


p{xi) 


(51) 


Notice this is a realizable of v in Procedure where the output eorresponds to a neural network y. 
If tOi’s are uniformly i.i.d. drawn from set Hi, then v has mean (which is computed over x, y and 
u) 

where |II;| denotes the surface area of d—1 dimensional manifold Hi, and S(-) denotes the Fourier 
transform of a{-). 


Proof: Let F(uj) denote the Fourier trans form of label function f{x) := 

bi)) which is ( Marks II and Arabshahi . 1994l f 


c] = {a2,(T(Ajx + 


Fico) = 




j^[k] 


a2(j) , 


j2TTbl(j) 




LO- — 




M{d,j) 


M{\d,j) 


(53) 


where S(-) is the Fourier transform of fT(-), ul = [ui,U 2 , ■ ■ ■ is vector vJ with the last entry 

removed, Ai{\d,j) G is the j-th. column of matrix Ai with the d-th (last) entry removed, and 
finally 5{u) = 5 {ui)5{u2) ■ ■ ■ d{ud)- 

Let p{oj) denote the probability density function of frequency uj. We have 


]£[{}] — '^x,y,u. 

1 

... L 



= IEx,a; 


p{x) 

= 

1-1 

Ct) 

1 

{ui,x) 



f{x)e ^ 

‘^’'''>p{uj)dxduj 

= [ F{uj)p{uj)d 
J Q; 

LO, 
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where the second equality uses the law of total expectation, the third equality exploits the label¬ 
generating function definition /(x) := E[y|x], and the final equality is from the definition of Fourier 
transform. The variable u G is drawn from a d — 1 dimensional manifold Q,i C In order to 
compute the above integral, we define d dimensional set 




j 1 ,, ,, 1 I, , j N , I 1 — e?/2 

^ : X - X ^ ll'^ll < X + X) (^i)/)| > 


2 2 “ " " “ 2 2 '' ' ' '' “ 2 
for which fl; = limj,^o+ Assuming w’s are uniformly drawn from il-i-i/, we have 


E[u] = lim f F{u)p{u)di 
<^^0+ Jq,.„ 


w 


= lim 


IQllu 

1 


r* + OD 




F{u)lQ^.^{u)duj. 


The second equality is from uniform draws of oj from set such that p{uj) = py—where 
l 5 (-) denotes the indicator function for set S. Here, denotes the volume of d dimensional 

subspace for which in the limit i' —?■ O'*', we have |H;|, where |fl;| denotes the surface 

area of d — 1 dimensional manifold fl;. 

For small enough in the definition of only the delta function for j = I in the expansion 
of F{u}) in (j53p is survived from the above integral, and thus, 


E[i)] = lim 


1 /^ 0 + 


r+oo 

J—oo 




Q 2(0 , 

\A,{d,i)\ yA,{d,i) 




LO- — 




Ai{d,iy 


Ai{\d,l) ln^.^{u})du}. 


In order to simplify the notations, in the rest of the proof we denote /-th column of matrix Ai by 
vector a, i.e., a := (Ai);. Thus, the goal is to compute the integral 


r--|-0O 


I : = 


1 


|ad| 


S ( — 

ad 


UJd\ j2nbiil)^ 


Wd 


d6 (uj- - a- lQi.^(Lj)duj, 

ad 


and note that E[f}] = 02 (f) • limj,^o+ | • The rest of the proof is about computing the above 
integral. The integral involves delta functions where the final value is expected to be computed 
at a single point specified by the intersection of line a;_ = and sphere ||a;|| = ^ (when we 


consider the limit u O"*"). This is based on the following integration property of delta functions 
such that for function g(-) : M —)• M, 


r+cxD 


g{t)5{t)dt = 5 ( 0 ). 


(54) 


We first expand the delta function as follows. 
f +00 


I = 
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\<^d\ 

/* + CxD 


ad 
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ujd-i - ^^^Wd ) lni.,iuj)duj, 


^d\ j2iTbi{l)^ 


s ( — e- 

ad J 
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(w) • ^ {adiOd-i - ad-iujd) dwi • • • Wd, 
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where we used the property = S{Pt) in the second equality. Introducing new variable z, and 

; the change of vari 

J J— OO \^d 


applying the change of variable cod = -— (ad^Od-i — z), we have 

^d—1 


I = 




- e , 

dd \ UJl - Ud 

ad 


£ I Oid-2 

■ d UJd -2 -Wd 


• 6{z)(kji ■ ■■(kjd-i 


dz 

cud-i 


/ +00 

-OO 


—OO ad —1 


Wd-l \ .?27rbi(/)^^ / ai 

- e “d-i () cai- oJd-i 

ad-i J \ ad-i 




ad 

U}l,U} 2 , ■ ■ ■ ,UJd-l, -Wrf_l 

ad-1 


r I ad-2 

' 0 I OJd -2 - 

V ad-i 

dbJi ■ ■ ■ dbJd-l- 


For the sake of simplifying the mathematical notations, we did not substitute all the w^’s with z in 
the first equality, but note that all ujdS are implicitly a function of z which is finally considered in 
the second equality where the delta integration property in (|M|l is applied to variable z (note that 
„ _ n fUc como oc ^ Repeating the above process several times, we finally have 


/ = 
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Old 

Old-1 

1 

( ^1 

—s 

— 

ai 

V«i, 


+OD 

1 / X riTTblU) - . 


02 ad-1 ad 

CJl, - UJl, . . . , - UJl, - UJl 

Oil ai Oil 


duji- 


There is a line constraint as ^ = ^ = • • • = ^ in the argument of indicator function. This implies 
that llwll = where we used ||a|| = ||(^i);|| = 1. Incorporating this in the norm bound 


ai 


imposed by the definition of we have ^{1 — u) < uii < ^{1 + u), and hence. 


I = 
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We know E['i}] = 02(0 ' ]nT 


and thus. 


aij 


E[u] = 02(/) 


-|77T ■ “1^- 

v-lUil ai 








where in the first step we use \^i-u\ = ^ ' |^i|) and write the integral I in the limit —)• O"*". This 

finishes the proof. □ 

In the following lemma, we argue the concentration of v around its mean which leads to the 
sample complexity bound for estimating the parameter bi (and also 02) within the desired error. 


Lemma 11. If the sample complexity 


n> O \ log - 


(55) 


d 

holds for small enough €2 < Qp then the estimates 02 ( 1 ) = |s(^/2)| 1^1 > ^i(0 = 

for I G [k], in NN-LIFT AlgorithmU\ (see the definition ofv in (fSTIl ) satisfy with probability at least 

1 - 6 , 


\a2il) - 02(01 < 


m 


|S(1/2)| 


0(62), |6l(0 -^l( 0 l < 




7r|S(l/2)||o2(0l 


0 ( 62 ). 
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Proof: The result is proved by arguing the concentration of variable v in (|5ip around its mean 

characterized in ([52]). We use the Bernstein’s inequality to do this. Let v := X]ie[n] where 
~ lower bound p{x) > xjj assumed in Theorem [3| and labels y^’s being 

bounded, the magnitude of centered ViS {vi — are bounded by The variance term is 

also bounded as 


a = 


E (u* - E[i}i])(tij - E[i}j]) 


ie\n\ 


where u denotes the complex conjugate of complex number u. This is bounded as 




a 


is n 


is n 


Vi 


2 1 


P{Xi 


Since output y is a binary label (y € {0,1}), we have E[y^|x] = E[y|x] = /(x), and thus, 


E 



= E 


E 


p{x) 


= E 


hx) 

p{xY 



where the inequality uses the bound p{x) > '0 and the last equality is from definition of Qj. This 
provides us the bound on variance as 




ijjn 


Applying Bernstein’s inequality concludes the concentration bound such that with probability at 
least 1 — we have 


V — E[i}]| < O 


1 1 1 



< 0 ( 62 ), 


where the last inequality is from sample complexity bound. This implies that ||i}| — |E['i;]|| < 0 (e 2 ). 
Substituting |E[u]| from ([5^ and considering estimate a 2 {l) = |s(?/ 2 )| 1^1’ have 


| d2(0 - 02(01 < 


|E{1/2)|°(«)' 


which finishes the first part of the proof. For the phase, we have cj) := Zu — ZE[i}] = 7 r{bi{l) — bi{l)). 
On the other hand, for small enough error 62 (and thus small (f)), we have the approximation 
(j) ~ tan((/)) ~ (note that this is actually an upper bound such that 0 < tan((/))). Thus, 


|5i(0-6i(0l < 


1 

7 r|E[i;]| 


0 ( 62 ) < 


vr|S(l/ 2 )||a 2 ( 0 l 


0(6~2). 


This finishes the proof of second bound. 


□ 
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C.3 Ridge regression analysis and gnarantees 

Let h := cr{Ajx + bi) denote the neuron or hidden layer variable. With slightly abuse of notation, 
in the rest of analysis in this section, we append variable h by the dummy variable 1 to represent 
the bias, and thus, h G We write the output as y = hJ(3 + y, where 

/3 := [02,62] 

Given the estimated parameters of first layer denoted by Ai and 61 , the neurons are estimated 
as h := a{Ajx + 61 ). In addition, the dummy variable 1 is also appended, and thus, h G 
Because of this estimated encoding of neurons, we expand the output y as 

y = h^/3+ {h^ ~ = fih) +r], (56) 

bias (approximation): b{h) noise 


where f{h) := E[y|/i] = (3 + b{h). Here, we have a noisy linear model with additional bias 

(approximation). Let f3\ denote the ridge regression estimator for some regularization parameter 
A > 0, which is defined as the minimizer of the regularized empirical mean squared error, i.e., 


/3a 


arg min — 

H n 
P ie[n] 


)hi) iji 



We know this estimator is given by (when + A/ 0) 

h={\ + \iy^ -nhy), 

where := A hihj is the empirical covariance of /i, and E denotes the empirical mean 

operator. The analysis of ridge regression leads to the following expected prediction error (risk) 
bound on the estimation of the output. 

Lemma 12 (Expected prediction error of ridge regression). Suppose the parameter recovery results 
in Lemmata 0 and M on Ai and bi hold. In addition, assume the nonlinear activating function 
a{-) satisfies the Lipschitz property such that \(j{u) — cr(u')| < L ■ \u — u'\, for u,u' G M. The 
following noise, approximation and statistical leverage conditions also hold. Then, by choosing the 
optimal A > 0 TO the X-regularized ridge regression (which estimates the parameters 0,2 and 62 the 
estimated output as f{x) = d 2 0 '{Ajx + 61 ) + 62 satisfies the risk bound 

nm - mf] < o + o (E[i>('‘)"i+^L.,/2)) + 


where 


E[b{hf] < 


r + 




7r|S(l/2)||a2(/)| 


|2 r2 


L^kO{i^ 


Proof: Since h := ^{Ajx + 61 ), we equivalently argue the bound on K[{hJfix — /(fi))^], where 
f{x) = f{h) = hAfi\. From st andard results in the study of inverse problems, we know (see 
Proposition 5 in Hsu et all ( 20141 )1 


n{h' fix - f{h)Y]=n{h' fi - f{h)Y] + 
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Here, for positive definite matrix S 0, the vector norm || • ||s is defined as ||n||s := V v^T,v. For 
the first term, by the definition of f(h) as f(h) := IE[y|h] = (3 + 6(h), we have 

E[(r/?-/(h))2]=E[6(h)2]. 

Lemma [13] bounds E[6(h)^] and bounding ||/3 a — /3||s. is argued in Lemma [TH and Remark [7| 
Combining these bounds finishes the proof. □ 

In order to have final risk bounded as E[|/(x) — /(x)p] < O(e^), for some e > 0, the above 
lemma imposes sample complexity as (some of other parameters considered in (j32l) . (I55|) are not 
repeated here) 


n > O L 


-(1 + cr^oise) 


(57) 


Lemma 13 (Bounded approximation). Suppose the parameter recovery results in Lemmata\^and\TJ\ 
on Ai and bi hold. In addition, assume the nonlinear activating function a{-) satisfies the Lipschitz 
property such that \cr{u) — < L ■ \u — u'\, for u,u' G M. Then, the approximation term is 

bounded as 


E[6(h)"] < 


r + 


vr|S(l/2)||a2(0l 


'^L'^kO{e^ 


Proof: We have 


E[6(h)2] =E[(h-h,/3)2] < 


'E[||h-hf]. 


(58) 


Define e := max{ei,e 2 }, where ei and €2 are the corresponding bounds in Lemmata and 111! 
respectively. Using the Lipschitz property of nonlinear function <t(-), we have 

\hi - hi\ = \a{{{Ai)i,x) + bi{l)) - a{{{Ai)i,x) + 6i(0)| 


< L ■ 

< L ■ 


\{{A,fi-{A,)i,x)\ + Mi)-h{i)\ 


rO{e) + 


7r|S(l/2)||a2(/)| 


0(e) 


where in the second inequality, we use the bounds in Lemmata P and [TTl and bounded x such that 
||x|| < r. Applying this to (l58l) concludes the proof. □ 


We now assume the following additional conditions to bound ||/3 a — fiWy,, ■ The following dis- 

I I h 

cussions are along the results of lHsu et al.l (j2014l L 

We define the effective dimensions of the covariate h as 


kp,x := 


A, 


Xj + A 




where Aj’s denote the (positive) eigenvalues of and A is the regularization parameter of ridge 
regression. 

• Subgaussian noise: there exists a finite Unoise > 0 such that, almost surely, 

E^[exp(ar/)|h] < exp(aV^oise/2), Va G M, 
where rj denotes the noise in the output y. 
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Bounded statistical leverage: there exists a finite p\>l such that, almost surely, 

y/k 

< P\- 


V^(inf{Aj} + A)A:i,a 

• Bounded approximation error at A: there exists a finite -Bbias,A > 0 such that, almost surely, 

P\ “1“ V^||/3||^ < -Bbias,A) 

where \b{h)\ < Bmax- Note that the approximation term h{h) is bounded in Lemma fTHl The 
parameter -Bbias,A only contributes to the lower order terms in the analysis of ridge regression. 

Lemma 14 (Bounding excess mean squared error: Theorem 2 of Hsu et al. ( 2014I H. Fix some 
A > 0, and suppose the above noise, approximation and statistical leverage conditions hold, and in 
addition, 

n > d{p\ki^x)- (59) 

Then, we have 

|2 


WPx ~ /3|ls^ < O — ^E[6(/i)^] + o-'^oise/‘^'j + 2 


1 + ^ M 1 +o(l/n) 


n 


where E[6(/i)^] is bounded in Lemma\T^ 


In the above lemma, we also used the discussions in Remarks 12 and 15 of iHsu et ahl (j201J) 
which include comments on the simplification of the general result. 


Remark 7 (Optimal A). In addition, along the discussion in Remark 15 o uHsu et al\ 1201 A ), by 
choosing the optimal A > 0 that minimizes the bound in the above lemma, we have 


Wx-m<o^ 

h \ n 


+ 0 


n 


^1 + • 

J ' nois 


D Proof of Theorem [5] 

Before we provide the proof, we first state the details of bound on Cj. We require 

1 






-1 


e[||S3(i)I|"1 •'» 
, , 

- /'mi 


\2 

'^min \ 

• An 


4in(-4l) 

’max (^i) 


ei 


A„ 


1.5 




'E[l|S3(i)fl 
^[ 1152 ( 1 ) 11 = 13 /“ 


V -^max / 

Amin • 'Smin(^l) 


1 


(60) 


Proof of Theorem We first argue that the perturbation involves both estimation and 

approximation parts. 
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Perturbation decomposition into approximation and estimation parts: Similar to the 
estimation part analysis, we need to ensure the perturbation from exact means is small enough to 
apply the analysis of Lemmas [9] and [TTJ Here, in addition to the empirical estimation of quantities 
(estimation error), the approximation error also contributes to the perturbation. This is because 
there is no realizable setting here, and the observations are from an arbitrary function /(x). We 
address this for both the tensor decomposition and the Fourier parts as follows. 

Recall that we use notation f{x) (and y) to denote the output of a neural network. For 
arbitrary function f{x), we refer to the neural network satisfying the approximation error provided 
in Theorem 0] by yj. The ultimate goal of our analysis is to show that NN-LIFT recovers the 
parameters of this specific neural network with small error. More precisely, note that these are a 
class of neural networks satisfying the approximation bound in Theorem [Sj and it suffices to say 
that the output of the algorithm is close enough to one of them. 

Tensor decomposition: There are two perturbation sources in the tensor analysis. One is 
from the approximation part and the other is from the estimation part. By Lemma [U the CP 
decomposition form is given by Ty = E[yy • 53(x)], and thus, the perturbation tensor is written as 

E :=ff -f = E[yy • 53(x)] - -Vi ■ S^ixi), 

” isN 

where T = ^ X]ie[n] Vi ' is the empirical form used in NN-LIFT Algorithm [TJ Note that the 

observations are from the arbitrary function y = f{x). The perturbation tensor can be expanded 
as 

E = E[yf ■ 53(x)] - E[y ■ 53(x)] -hE[y • 53(x)] - -Vi ■ Ssixi), 

' ^ *e[n] 

•—£/apx. 

“-^est. 

where -Eapx. and -Eest. respectively denote the perturbations from approximation and estimation 
parts. 

We also desire to use the exact second order moment M 2 J = E[yf ■ 52(x)] for the whitening 
Procedure 0] in the tensor decomposition method. But, we have an empirical version for which the 
perturbation matrix E 2 := M 2 J — M 2 is expanded as 

E 2 = E[yf ■ 52(x)] -E[y • 52(x)] E[?/ • 52(x)] - - Y] 2/i •52(xj), 

*6[n] 

•—-^2,apx. ^ 

: = E2,est. 

where E' 2 ,apx. and E 2 ^est. respectively denote the perturbations from approximation and estimation 
parts. 

In Theorem [3] where there is no approximation error, we only need to analyze the estimation 
perturbations characterized in (f33]l and (1331) since the neural network output is directly observed 
(and thus, we use y to denote the output). Now, the goal is to argue that the norm of perturbations 
E and E 2 are small enough (see Lemma ITO]) . ensuring the tensor power iteration recovers the rank-1 
components of Ty = E[i/y • 53(x)] with bounded error. Again recall from Lemma[6]that the rank-1 
components of tensor Ty = E[yy • 53(x)] are the desired components to recover. 
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The estimation perturbations Sest. and -E 2 ,est. are similarly bounded as in Lemma [9] (see (I35p 
and (l36]l L and thus, we have w.h.p. 


||i5^2,est.|| < O (^Y^E[||52 (x)52T(x)||]) , 

where M^{x) G denotes the matricization of score function tensor S'i{x) G and ymax 

is the bound on \f{x)\ = \y\. 

The norm of approximation perturbation -Bapx. := ^[{Vf — v) ' *53(3:)] is bounded as 

||£^apx.|| = ||E[(y/ -y) •53 (x)]|| 

< E[||(y/ -y) •53 (x)||] 

= E[|y/-y| • ||53 (x)||] 

<(E||j;,-!,|2l.E|||S3(i)fl)‘'", 

where the first inequality is from the Jensen’s inequality applied to convex norm function, and we 
used Cauchy-Schwartz in the last inequality. Applying the approximation bound in Theorem HI we 
have 

||i^apx.|| < O(rCy) • + <Ji) • \/E[||53(x)f], (61) 

and similarly, 

||i?2,apx.|| < 0{rCf) • + Ji) • VE[||52(x)f], 

We now need to ensure the overall perturbations E = Egst. + E^apx. and E 2 = T^ 2 ,est. + T' 2 ,apx. 
satisfies the required bounds in Lemma [TOl Note that similar to what we do in Lemma [9l we hrst 
impose a bound such that the term involving ||ill|| is dominant in (1461) . Bounding the estimation 
part I Idlest. II provides similar sample complexity as in estimation Lemma [9] with ymax substituted 
by ymax- 

For the approximation error, by imposing (third bound stated in the theorem) 


C ^<0 



/I \-^ E[||53(x)f]V^ 
\Vk 7 E[||52(x)f]3/4 


■ -^r 


T.5 



we ensure that the term involving HFill is dominant in the final recovery error in (j46p . By doing 
this, we do not need to impose the bound on ||E' 2 ,apx.|| anymore, and applying the bound in (l6TTl 
to the required bound on ||F1|| in Lemma [10] leads to bound (second bound stated in the theorem) 


Cf<0 


1 / 1 


y/k 


+ <^1 


-1 


1.5 


IE[||53(x)|r 


•A, 




y/k 
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Finally, applying the result of Lemma [TOl we have the column-wise error guarantees (up to permu¬ 
tation) 




< o 


^max (^i) 


IE, 


est. I 


+ 11^. 


apx. I 


. VVin ^mfn ' 'SLn(^l) 

^max 'Smax(^l) 




2/max 

\fn 


E[\\M3ix)Mj {x)\\] 


+ rCr{-^^+s, 


IE[||53(x)||^ 


<o{h), 


where in the second inequality we substituted the earlier bounds on ||F'est. || and ||-Eapx. ||) and the 
first bounds on n and Cf stated in the theorem are used in the last inequality. 


Fourier part: Let 

^ :=iy 

n p{xi) 

i&[n\ 

Note that this a realization of v dehned in (|5ip when the output is generated by a neural network 
satisfying approximation error provided in Theorem 0] denoted hy yf, see the discussion in the 
beginning of the proof. 

The perturbation is now 

e := E[hd - - y . 

n y p{xi) 

i&[n\ 

'-V-' 

=:v 

Similar to the tensor decomposition part, it can be expanded to estimation and approximation 
parts as 

e := E[vf] — E[u] + E[u] — v . 

6apx. ^est. 

Similar to Lemma [TTl the estimation error is w.h.p. bounded as 

l^est. I ^ 0(62), 

if the sample complexity satisfies n > O , where Q := f{x)'^dx. Notice the difference 

between Q and Cf- The approximation part is also bounded as 

|eapx.| < ^E[\yf - y\] < ^^E[\yf-y\'^] < ^0{rCf) ■ + 61 ^ , 

where the last inequality is from the approximation bound in Theorem 01 Imposing the condition 

Cf • 0 {i;e2) (62) 

satisfies the desired bound |eapx. | < 0 {e2)- The rest of the analysis is the same as Lemma [TTl 
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Ridge regression: It introduces an additional approximation term in the linear regression for¬ 
mulated in (j56j). Given the above bounds on Cf, the new approximation term only contributes to 
lower order terms. 

Combining the analyzes for tensor decomposition, Fourier and ridge regression parts finishes 
the proof. □ 


D.l Discussion on Corollary [T] 

Similar to the specific Gaussian kernel function, we can also provide the results for other kernel 
functions and in general for positive definite f unctions as foll ows. f{x) is said to be positive definite 


BarronI ([l993j) shows that positive definite functions 


if I XiXjf{xj — xi) > 0, for all Xj,xi G M'^. 
have Cf bounded as 

C/< V-/(0)-V2/(0), 

where V^/(x) := Note that the operator is different from the derivative 

operator that we defined in (jH). Applying this to the proposed bound in IfTSl) . we conclude 
that our algorithm can train a neural network which approximates a class of positive definite and 
kernel functions with similar bounds as in Theorem[5l Corollary [T] is for the special case of Gaussian 
kernel function. 

Proo f of Corollar y [1} For the location and scale mixture f{x) := f K(a(x + j3))G(da, d/3), we 
have ( BarronI . 1993li Cf < Ck • / |ct| • \G\{da, d/3), where Ck denotes the correspond ing parameter 
for K{x). For the standard Gaussian kernel function K{x) considered here, we have ( BarronI . 1993l i 
Ck < yfd, which concludes 


Cf<Vd- 


j |a| • \G\{da,d^). 


We now apply the required bound in (fT8]l to finish the proof. But, in this specific setting, we also 
have the following simplifications for the bound in (IlSp . 

For the Gaussian input x ~ Af{0,a'^ld), the score function is 


Ssix) = 


tX — 


(J^ 




(x <S> ej (8* ej + ej (S> x <S> ej -1- ej <S> ej (8* x), 


ie[d] 


which has expected square norm as E[||53(x)|p] = 0 {d?ja^). 

Given the input is Gaussian and the activating function is the step function, we can also write 


the coefficients \j and \j as 


Aj = 02 (j) 
Aj = 02 (j) 


bi{j) 


■ exp 


• exp 


vrcj^ 


Mil! 

Mil! 

2 al 




Given the bounds on coefficients as \hi{j)\ < 1, |o 2 (j)| < 2(7/, j G [k], we have 

(^l)r 


Amin ^ (02) r 

Am ax 

A 


2Cf 


■exp{-l/{2aj). 


mm — 


(02)r 


y/^i 


Tiaz 


■ exp(-l/( 2 f 72 )) . min \bi{j)^/al - 1 

J6[fc] 
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Recall that the columns of Ai are randomly drawn from the Fourier spectrum of f{x) as de¬ 
scribed in (fT5]l . Given f{x) is the Gaussian kernel, the Fourier spectrum ||a ;||-|Ffa;)| corresponds to a 
sub-g aussain distribution. Thus, the singular values of Ai are bounded as (|Rudelson and Vershvninl. 


2009!) 


^min (^i) ^ 1 - \/Wd 


Ml) 


> 0 ( 1 ), 


1 -|- ^k/d 

where the last inequality is from k = Cd for some small enough C < 1. 

Substituting these bounds in the required bound in IfTSI) finishes the proof. 


□ 
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